home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Internet Info 1994 March
/
Internet Info CD-ROM (Walnut Creek) (March 1994).iso
/
inet
/
internet-drafts
/
draft-ietf-avt-issues-01.txt
< prev
next >
Wrap
Text File
|
1993-10-21
|
173KB
|
3,532 lines
Internet Engineering Task Force Audio-Video Transport Working Group
INTERNET-DRAFT H. Schulzrinne
draft-ietf-avt-issues-01.txt AT&T Bell Laboratories
October 20, 1993
Expires: 03/01/94
Issues in Designing a Transport Protocol for Audio and Video Conferences and
other Multiparticipant Real-Time Applications
Status of this Memo
This document is an Internet Draft. Internet Drafts are working documents
of the Internet Engineering Task Force (IETF), its Areas, and its Working
Groups. Note that other groups may also distribute working documents as
Internet Drafts.
Internet Drafts are draft documents valid for a maximum of six months.
Internet Drafts may be updated, replaced, or obsoleted by other documents
at any time. It is not appropriate to use Internet Drafts as reference
material or to cite them other than as a ``working draft'' or ``work in
progress.''
Please check the I-D abstract listing contained in each Internet Draft
directory to learn the current status of this or any other Internet Draft.
Distribution of this document is unlimited.
Abstract
This memorandum is a companion document to the current version
of the RTP protocol specification draft-ietf-avt-rtp-*.{txt,ps}.
It discusses aspects of transporting real-time services (such as
voice or video) over the Internet. It compares and evaluates
design alternatives for a real-time transport protocol, providing
rationales for the design decisions made for RTP. Also covered are
issues of port assignment and multicast address allocation. A
comprehensive glossary of terms related to multimedia conferencing
is provided.
This document is a product of the Audio-Video Transport working group within
the Internet Engineering Task Force. Comments are solicited and should be
addressed to the working group's mailing list at rem-conf@es.net and/or the
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
author(s).
Contents
1 Introduction 4
1.1 T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Goals 7
3 Services 9
3.1 Duplex or Simplex? . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Framing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Version Identification . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Conference Identification. . . . . . . . . . . . . . . . . . . . . 14
3.4.1Demultiplexing. . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Media Encoding Identification. . . . . . . . . . . . . . . . . . . 16
3.5.1Audio Encodings . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.2Video Encodings . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Playout Synchronization. . . . . . . . . . . . . . . . . . . . . . 19
3.6.1Synchronization Methods . . . . . . . . . . . . . . . . . . . . 21
3.6.2Detection of Synchronization Units. . . . . . . . . . . . . . . 22
3.6.3Interpretation of Synchronization Bit . . . . . . . . . . . . . 24
3.6.4Interpretation of Timestamp . . . . . . . . . . . . . . . . . . 25
3.6.5End-of-talkspurt indication . . . . . . . . . . . . . . . . . . 29
3.6.6Recommendation. . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Segmentation and Reassembly. . . . . . . . . . . . . . . . . . . . 30
3.8 Source Identification. . . . . . . . . . . . . . . . . . . . . . . 31
H. Schulzrinne Expires 03/01/94 [Page 2]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
3.8.1Bridges, Translators and End Systems. . . . . . . . . . . . . . 31
3.8.2Address Format Issues . . . . . . . . . . . . . . . . . . . . . 33
3.8.3Globally unique identifiers . . . . . . . . . . . . . . . . . . 34
3.8.4Locally unique addresses. . . . . . . . . . . . . . . . . . . . 35
3.9 Energy Indication. . . . . . . . . . . . . . . . . . . . . . . . . 37
3.10Error Control. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.11Security and Privacy . . . . . . . . . . . . . . . . . . . . . . . 39
3.11.1Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.11.2Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . 40
3.11.3Message Integrity and Authentication. . . . . . . . . . . . . . 41
3.12Security for RTP vs. PEM. . . . . . . . . . . . . . . . . . . . . 42
3.13Quality of Service Control . . . . . . . . . . . . . . . . . . . . 44
3.13.1QOS Measures. . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.13.2Remote measurements . . . . . . . . . . . . . . . . . . . . . . 45
3.13.3Monitoring by Third Party . . . . . . . . . . . . . . . . . . . 46
4 Conference Control Protocol 46
5 The Use of Profiles 46
6 Port Assignment 47
7 Multicast Address Allocation 48
7.1 Channel Sensing. . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Global Reservation Channel with Scoping. . . . . . . . . . . . . . 50
7.3 Local Reservation Channel. . . . . . . . . . . . . . . . . . . . . 50
7.3.1Hierarchical Allocation with Servers. . . . . . . . . . . . . . 51
7.3.2Distributed Hierarchical Allocation . . . . . . . . . . . . . . 51
7.4 Restricting Scope by Limiting Time-to-Live . . . . . . . . . . . . 52
H. Schulzrinne Expires 03/01/94 [Page 3]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
8 Security Considerations 52
A Glossary 52
B Address of Author 62
1 Introduction
This memorandum
1.1 T
he transport protocol for real-time applications (RTP) discussed in the pr
this memorandum aims to provide services commonly required by interactive
multimedia conferences, such as playout synchronization, demultiplexing,
media identification and active-party identification. However, RTP is not
restricted to multimedia conferences; it is anticipated that other real-time
services such as remote data acquisition and control may find its services
of use.
In this context, a conference describes associations that are characterized
by the participation of two or more agents, interacting in real time
with one or more media of potentially different types. The agents are
anticipated to be human, but may also be measurement devices, remote media
servers, simulators and the like. Both two-party and multiple-party
associations are to be supported, where one or more agents can take active
roles, i.e., generate data. Thus, applications not commonly considered a
conference fall under this wider definition, for example, one-way media such
as the network equivalent of closed-circuit television or radio, traditional
two-party telephone conversations or real-time distributed simulations.
Even though intended for real-time interactive applications, the use of
RTP for the storage and transmission of recorded real-time data should be
possible, with the understanding that the interpretation of some fields such
as timestamps may be affected by this off-line mode of operation.
RTP uses the services of an end-to-end transport protocol such as UDP,
TCP, OSI TP1 or TP4, ST-II or the like(1) . The services used are:
end-to-end delivery, framing, demultiplexing and multicast. The underlying
network is not assumed to be reliable and can be expected to lose, corrupt,
arbitrarily delay and reorder packets. However, the use of RTP within
------------------------------
1. ST-II is not properly a transport protocol, as it is visible to
intermediate nodes, but it provides services such as process demultiplexing
commonly associated with transport protocols.
H. Schulzrinne Expires 03/01/94 [Page 4]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
quality-of-service (e.g., rate) controlled networks is anticipated to be of
particular interest. Network layer support for multicasting is desirable,
but not required. RTP is supported by a real-time control protocol (RTCP)
in a relationship similar to that between IP and ICMP. However, RTP can be
used, with reduced functionality, without a control protocol. The control
protocol RTCP provides minimum functionality for maintaining conference
state for one or more flows within a single transport association. RTCP
is not guaranteed to be reliable; each participant simply sends the local
information periodically to all other conference participants.
As an alternative, RTP could be used as a transport protocol layered
directly on top of IP, potentially increasing performance and reducing
header overhead. This may be attractive as the services provided by UDP,
checksumming and demultiplexing, may not be needed for multicast real-time
conferencing applications. This aspect remains for further study. The
relationships between RTP and RTCP to other protocols of the Internet
protocol suite are depicted in Fig. 1.
+--------------------------+-----------------------------+
| | conference controller |
| media application |-------------------+ |
| | conf. ctl. prot. | |
+--------------------------+-------------------+---------+
| | RTCP | |
| +-------------------+ |
| RTP |
+--------+-----------------+ |
| | UDP | |
| ST-II +-----------------+-------------+ |
| | IP | |
+--------------------------------------------------------+
| AAL5 |
+--------------------------------------------------------+
Figure 1: Embedding of RTP and RTCP in Internet protocol stack
Conferences encompassing several media are managed by a (reliable)
conference control protocol, whose definition is outside the scope of this
note. Some aspects of its functionality, however, are described in
Section 4.
Within this working group, some common encoding rules and algorithms for
media have been specified, keeping in mind that this aspect is largely
independent of the remainder of the protocol. Without this specification,
interoperability cannot be achieved. It is intended, however, to keep
the two aspects as separate RFCs as changes in media encoding should
be independent of the transport aspects. The encoding specification
includes issues such as byte order for multi-byte samples, sample order
for multi-channel audio, the format of state information for differential
encodings, the segmentation of encoded video frames into packets, and the
H. Schulzrinne Expires 03/01/94 [Page 5]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
like.
When used for multimedia services, RTP sources will have to be able to
convey the type of media encoding used to the receivers. The number
of encodings potentially used is rather large, but a single application
will likely restrict itself to a small subset of that. To allow the
participants in conferences to unambiguously communicate to each other the
current encoding, the working group is defining a set of encoding names to
be registered with the Internet Assigned Numbers Authority (IANA). Also,
short integers for a default mapping of common encodings are specified.
The issue of port assignment will be discussed in more detail in Section 6.
It should be emphasized, however, that UDP port assignment does not imply
that all underlying transport mechanisms share this or a similar port
mechanism.
This memorandum aims to summarize some of the discussions held within the
audio-video transport (AVT) working group chaired by Stephen Casner, but
the opinions are the author's own. Where possible, references to previous
work are included, but the author realizes that the attribution of ideas is
far from complete. The memorandum builds on operational experience with
Van Jacobson's and Steve McCanne's vat audio conferencing tool as well as
implementation experience with the author's Nevot network voice terminal.
This note will frequently refer to NVP [1], the network voice protocol,
a protocol used in two versions for early Internet wide-area packet voice
experiments. CCITT has standardized as recommendations G.764 and G.765
a packet voice protocol stack for use in digital circuit multiplication
equipment.
The name RTP was chosen to reflect the fact that audio and video
conferences may not be the only applications employing its services, while
the real-time nature of the protocol is important, setting it apart from
other multimedia-transport mechanisms, such as the MIME multimedia mail
effort [2].
The remainder of this memorandum is organized as follows. Section 2
summarizes the design goals of this real-time transport protocol. Then,
Section 3 describes the services to be provided in more detail. Section 4
briefly outlines some of the services added by a higher-layer conference
control protocol; a more detailed description is outside the scope of
this document. Two appendices discuss the issues of port assignment and
multicast address allocation, respectively. A glossary defines terms and
acronyms, providing references for further detail. The actual protocol
specification embodying the recommendation and conclusions of this report is
contained in a separate document.
H. Schulzrinne Expires 03/01/94 [Page 6]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
2 Goals
Design decisions should be measured against the following goals, not
necessarily listed in order of importance:
content flexibility: While the primary applications that motivate the
protocol design are conference voice and video, it should be
anticipated that other applications may also find the services
provided by the protocol useful. Some examples include distribution
audio/video (for example, the ``Radio Free Ethernet''application by
Sun), distributed simulation and some forms of (loss-tolerant) remote
data acquisition (for example, active badge systems [3,4]). Note that
it is possible that the same packet header field may be interpreted in
different ways depending on the content (e.g., a synchronization bit
may be used to indicate the beginning of a talkspurt for audio and the
beginning of a frame for video). Also, new formats of established
media, for example, high-quality multi-channel audio or combined audio
and video sources, should be anticipated where possible.
extensible: Researchers and implementors within the Internet community are
currently only beginning to explore real-time multimedia services such
as video conferences. Thus, the RTP should be able to incorporate
additional services as operational experience with the protocol
accumulates and as applications not originally anticipated find its
services useful. The same mechanisms should also allow experimental
applications to exchange application-specific information without
jeopardizing interoperability with other applications. Extensibility
is also desirable as it will hopefully speed along the standardization
effort, making the consequences of leaving out some group's favorite
fixed header field less drastic.
It should be understood that extensibility and flexibility may conflict
with the goals of bandwidth and processing efficiency.
independent of lower-layer protocols: RTP should make as few assumptions
about the underlying transport protocol as possible. It should, for
example, work reasonably well with UDP, TCP, ST-II, OSI TP, VMTP and
experimental protocols, for example, protocols that support resource
reservation and quality-of-service guarantees. Naturally, not all
transport protocols are equally suited for real-time services; in
particular, TCP may introduce unacceptable delays over anything but
low-error-rate LANs. Also, protocols that deliver streams rather than
packets needs additional framing services as discussed in Section 3.2.
It remains to be discussed whether RTP may use services provided by the
lower-layer protocols for its own purposes (time stamps and sequence
numbers, for example).
H. Schulzrinne Expires 03/01/94 [Page 7]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
The goal of independence from lower-layer considerations also affects
the issue of address representation. In particular, anything too
closely tied to the current IP 4-byte addresses may face early
obsolescence. It is to be anticipated, however, that experience gained
will suggest a new protocol revision in any event by that time.
bridge-compatible: Operational experience has shown that RTP-level bridges
are necessary and desirable for a number of reasons. First, it
may be desirable to aggregate several media streams into a single
stream and then retransmit it with possibly different encoding, packet
size or transport protocol. A packet ``translator'' that achieves
multicasting by user-level copying may be needed where multicast
tunnels or IP connectivity are unavailable or the end-systems are not
multicast-capable.
bandwidth efficient: It is anticipated that the protocol will be used in
networks with a wide range of bandwidths and with a variety of media
encodings. Despite increasing bandwidths within the national backbone
networks, bandwidth efficiency will continue to be important for
transporting conferences across 56 kb links, office-to-home high-speed
modem connections and international links. To minimize end-to-end
delay and the effect of lost packets, packetization intervals have to
be limited, which, in combination with efficient media encodings, leads
to short packet sizes. Generally, packets containing 16 to 32 ms of
speech are considered optimal [5--7]. For example, even with a 65 ms
packetization interval, a 4800 b/s encoding produces 39 byte packets.
Current Internet voice experiments use packets containing around 20 ms
of audio, which translates into 160 bytes of audio information coded
at 64 kb/s. Video packets are typically much longer, so that header
overhead is less of a concern.
For UDP multicast (without counting the overhead of source routing as
currently used in tunnels or a separate IP encapsulation as planned),
IPv4 incurs 20 bytes and UDP an additional 8 bytes of header overhead,
to which datalink layer headers of at least 4 bytes must be added.
With RTP header lengths between 4 and 8 bytes, the total overhead
amounts to between 36 and 40 (or more) bytes per audio or video packet.
For 160-byte audio packets, the overhead of 8-byte RTP headers together
with UDP, IP and PPP (as an example of a datalink protocol) headers is
25%. For low bitrate coding, packet headers can easily double the
necessary bit rate.
Thus, it appears that any fixed headers beyond eight bytes would have
to make a significant contribution to the protocol's capabilities as
such long headers could stand in the way of running RTP applications
over low-speed links. The current fixed header lengths for NVP and
vat are 4 and 8 bytes, respectively. It is interesting to note that
G.764 has a total header overhead, including the LAPD data link layer,
of only 8 bytes, as the voice transport is considered a network-layer
protocol. The overhead is split evenly between layers 2 and 3.
H. Schulzrinne Expires 03/01/94 [Page 8]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
Bandwidth efficiency can be achieved by transporting non-essential or
slowly changing protocol state in optional fields or in a separate
low-bandwidth control protocol. Also, header compression [8] may be
used.
international: Even now, audio and video conferencing tools are used far
beyond the North American continent. It would seem appropriate to give
considerations to internationalization concerns, for example to allow
for the European A-law audio companding and non-US-ASCII character sets
in textual data such as site identification.
processing efficient: With arrival rates of on the order of 40 to 50
packets per second for a single voice or video source, per-packet
processing overhead may become a concern, particularly if the
protocol is to be implemented on other than high-end workstations.
Multiplication and division operations should be avoided where possible
and fields should be aligned to their natural size, i.e., an n-byte
integer is aligned on an n-byte multiple, where possible.
implementable now: Given the anticipated lifetime and experimental nature
of the protocol, it must be implementable with current hardware and
operating systems. That does not preclude that hardware and operating
systems geared towards real-time services may improve the performance
or capabilities of the protocol, e.g., allow better intermedia
synchronization.
3 Services
The services that may be provided by RTP are summarized below. Note that
not all services have to be offered. Services anticipated to be optional
are marked with an asterisk.
o framing (*)
o demultiplexing by conference/association (*)
o demultiplexing by media source
o demultiplexing by conference
o determination of media encoding
o playout synchronization between a source and a set of destinations
o error detection (*)
o encryption (*)
H. Schulzrinne Expires 03/01/94 [Page 9]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
o quality-of-service monitoring (*)
In the following sections, we will discuss how these services are reflected
in the proposed packet header. Information to be conveyed within the
conference can be roughly divided into information that changes with every
data packet and other information that stays constant for longer time
periods. State information that does not change with every packet can be
carried in several different ways:
as a fixed part of the RTP header: This method is easiest to decode and
ensures state synchronization between sender and receiver(s), but can
be bandwidth inefficient or restrict the amount of state information to
be conveyed.
as a header option: The information is only carried when needed. It
requires more processing by the sending and receiving application. If
contained in every packet, it is also less bandwidth-efficient than the
first method.
within RTCP packets: This approach is roughly equivalent to header options
in terms of processing and bandwidth efficiency. Some means of
identifying when a particular option takes effect within the data
stream may have to be provided.
within a multicast conference announcement: Instead of residing at a well-
known conference server, information about on-going or upcoming
conferences may be multicast to a well-known multicast address.
within conference control: The state information is conveyed when the
conference is established or when the information changes. As for RTCP
packets, a synchronization mechanism between data and control may be
required for certain information.
through a conference directory: This is a variant of the conference control
mechanism, with a (distributed) directory at a well-known (multicast)
address maintaining state information about on-going or scheduled
conferences. Changing state information during a conference is
probably more difficult than with conference control as participants
need to be told to look at the directory for changed information.
Thus, a directory is probably best suited to hold information that will
persist through the life of the conference, for example, its multicast
group, list of media encodings, title and organizer.
The first two methods are examples of in-band signaling, the others of
out-of-band signaling.
Options can be encoded in a number of ways, resulting in different tradeoffs
between flexibility, processing overhead and space requirements. In
H. Schulzrinne Expires 03/01/94 [Page 10]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
general, options consists of a type field, possibly a length field, and
the actual option value. The length field can be omitted if the length
is implied by the option type. Implied-length options save space, but
require special treatment while processing. While options with explicit
length that are added in later protocol versions are backwards-compatible
(the receiver can just skip them), implied-length options cannot be added
without modifying all receivers, unless they are marked as such and all have
a known length. As an example, IP defines two implied-length options, no-op
and end-of-option, both with a length of one octet. Both CLNP and IP follow
the type-length-data model, with different substructure of the type field.
For indicating the extent of options, a number of alternatives have been
suggested.
option length: The fixed header contains a field containing the length of
the options, as used for IP. This makes skipping over options easy, but
consumes precious header space.
end-of-options bit: Each option contains a special bit that is set only for
the last option in the list. In addition, the fixed header contains
a flag indicating that options are present. This conserves space
in the fixed header, at the expense of reducing usable space within
options, e.g., reducing the number of possible option types or the
maximum option length. It also makes skipping options somewhat more
processing-intensive, particulary if some options have implied lengths
and others have explicit lengths. Skipping through the options list
can be accelerated slightly by starting options with a length field.
end-of-options option: A special option type indicates the end of the
option list, with a bit in the fixed header indicating the presence of
options. The properties of this approach are similar to the previous
one, except that it can be expected to take up more header space.
options directory: An options-present bit in the fixed header indicates
the presence of an options directory. The options directory in
turn contains a length field for the options list and possibly bits
indicating the presence of certain options or option classes. The
option length makes skipping options fast, while the presence bits
allow a quick decision whether the options list should be scanned for
relevant options. If all options have a known, fixed length, the bit
mask can be used to directly access certain options, without having
to traverse parts of the options list. The drawback is increased
header space and the necessity to create the directory. If options are
explicitly coded in the bit mask, the type, number and numbering of
options is restricted. This approach is used by PIP [9].
H. Schulzrinne Expires 03/01/94 [Page 11]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
3.1 Duplex or Simplex?
In terms of information flow, protocols can be roughly divided into three
categories:
1. For one instance of a protocol, packets travel only in one direction;
i.e., the receiver has no way to directly influence the sender. UDP is
an example of such a protocol.
2. While data only travels in one direction, the receiver can send back
control packets, for example, to accept or reject a connection, or
request retransmission. ST-II in its standard simplex mode is an
example; TCP is symmetric (see next item), but during a file transfer,
it typically operates in this mode, where one side sends data and the
receiver of the data returns acknowledgements.
3. The protocol is fully symmetric during the data transfer phase, with
user data and control information travelling in both directions. TCP
is a symmetric protocol.
Note that bidirectional data flow can usually be simulated by two or more
one-directional data flows in opposite directions, however, if the data
sinks need to transmit control information to the source, a decoupled stream
in the reverse direction will not do without additional machinery to bridge
the gap between the two protocol state machines.
For most of the anticipated applications for a real-time transport
protocol, one-directional data flow appears sufficient. Also, in general,
bidirectional flows may be difficult to maintain in one-to-many settings
commonly found in conferences. Real-time requirements combined with
network latency make achieving reliability through retransmission difficult,
eliminating another reason for a bidirectional communication channel. Thus,
we will focus only on control flow from the receiver of a data flow to its
sender. For brevity, we will refer to packets of this control flow as
reverse control packets.
There are at least two areas within multimedia conferences where a receiver
needs to communicate control information back to the source. First, the
sender may want or need to know how well the transmission is proceding,
as traditional feedback through acknowledgements is missing (and usually
infeasible due to acknowledgment implosion). Secondly, the receiver should
be able to request a selective update of its state, for example, to obtain
missing image blocks after joining an on-going conference. Note that for
both uses, unicast rather than multicast is appropriate.
Three approaches allowing the sender to distinguish reverse control packets
from data packets are compared here:
H. Schulzrinne Expires 03/01/94 [Page 12]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
sender port equals reverse port, marked packet: The same port number is
used both for data and return control messages. Packets then have to
be marked to allow distinguishing the two. Either the presence of
certain options would indicate a reverse control packet, or the options
themselves would be interpreted as reverse control information, with
the rest of the packet treated as regular data. The latter approach
appears to be the most flexible and symmetric, and is similar in
spirit to transport protocols with piggy-backed acknowledgements as in
TCP. Also, since several conferences with different multicast addresses
may be using the same port number, the receiver has to include the
multicast address in its reverse control messages. As a final
identification, the control packets have to bear the flow identifier
they belong to. The scheme has the grave disadvantage that every
application on a host has to receive the reverse control messages and
decide whether it involves a flow it is responsible for.
single reverse port: Reverse control packets for all flows use a single
port that differs from the data port. Since the type of the packet
(control vs. data) is identified by the port number, only the
multicast address and flow number still needs to be included, without a
need for a distinguishing packet format. Adding a port means that port
negotiation is somewhat more complicated; also, as in the first scheme,
the application still has to demultiplex incoming control messages.
different reverse port for each flow: This method requires that each source
makes it known to all receivers on which port it wishes to receive
reverse control messages. Demultiplexing based on flow and multicast
address is no longer necessary. However, each participant sending
data and expecting return control messages has to communicate the port
number to all other participants. Since the reverse control port
number should remain constant throughout the conference (except after
application restarts), a periodic dissemination of that information is
sufficient. Distributing the port information has the advantage that
it gives applications the flexibility to designate only certain flows
as potential recipients of reverse control information.
Unfortunately, the delay in acquiring the reverse control port number
when joining an on-going conference may make one of the more
interesting uses of a reverse control channel difficult to implement,
namely the request by a new arrival to the sender to transmit the
complete current state (e.g., image) rather than changes only.
3.2 Framing
To satisfy the goal of transport independence, we cannot assume that the
lower layer provides framing. (Consider TCP as an example; it would
probably not be used for real-time applications except possibly on a local
network, but it may be useful in distributing recorded audio or video
segments.) It may also be desirable to pack several RTPDUs into a single
H. Schulzrinne Expires 03/01/94 [Page 13]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
TPDU.
The obvious solution is to provide for an optional message length prefixed
to the actual packet. If the underlying protocol does not message
delineation, both sender and receiver would know to use the message length.
If used to carry multiple RTPDUs, all participants would have to arrive
at a mutual agreement as to its use. A 16-bit field should cover most
needs, but appears to break the 4-byte alignment for the rest of the header.
However, an application would read the message length first and then copy
the appropriate number of bytes into a buffer, suitably aligned.
3.3 Version Identification
Humility suggests that we anticipate that we may not get the first iteration
of the protocol right. In order to avoid ``flag days'' where everybody
shifts to a new protocol, a version identifier could ensure continued
interoperability. Alternatively, a new port could be used, as long as only
one port (or at most a few ports) is used for all media. The difficulty in
interworking between the current vat and NVP protocols further affirms the
desirability of a version identifier. However, the version identifier can
be anticipated to be the most static of all proposed header fields. Since
the length of the header and the location and meaning of the option length
field may be affected by a version change, encoding the version within an
optional field is not feasible.
Putting the version number into the control protocol packets would make RTCP
mandatory and would make rapid scanning of conferences significantly more
difficult.
vat currently offers a 2-bit version field, while this capability is missing
from NVP. Given the low bit usage and their utility in other contexts (IP,
ST-II), it may be prudent to include a version identifier. To be useful,
any version field must be placed at the very beginning of the header.
Assigning an initial version value of one to RTP allows interoperability
with the current vat protocol.
3.4 Conference Identification
A conference identifier (conference ID) could serve two mutually exclusive
functions: providing another level of demultiplexing or a means of
logically aggregating flows with different network addresses and port
numbers. vat specifies a 16-bit conference identifier.
H. Schulzrinne Expires 03/01/94 [Page 14]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
3.4.1 Demultiplexing
Demultiplexing by RTP allows one association characterized by destination
address and port number to carry several distinct conferences. However,
this appears to be necessary only if the number of conferences exceeds the
demultiplexing capability available through (multicast) addresses and port
numbers.
Efficiency arguments suggest that combining several conferences or media
within a single multicast group is not desirable. Combining several
conferences or media within a single multicast address reduces the bandwidth
efficiency afforded by multicasting if the sets of destinations are
different. Also, applications that are not interested in a particular
conference or capable of dealing with particular medium are still forced to
handle the packets delivered for that conference or medium. Consider as an
example two separate applications, one for audio, one for video. If both
share the same multicast address and port, being differentiated only by the
conference identifier, the operating system has to copy each incoming audio
and video packet into two application buffers and perform a context switch
to both applications, only to have one immediately discard the incoming
packet.
Given that application-layer demultiplexing has strong negative efficiency
implications and given that multicast addresses are not an extremely
scarce commodity, there seems to be no reason to burden every application
with maintaining and checking conference identifiers for the purpose of
demultiplexing. However, if this protocol is to be used as a transport
protocol, demultiplexing capability is required.
It is also not recommended to use a conference identifier to distinguish
between different encodings, as it would be difficult for the application
to decide whether a new conference identifier means that a new conference
has arrived or simply all participants should be moved to the new conference
with a different encoding. Since the encoding may change for some but
not all participants, we could find ourselves breaking a single logical
conference into several pieces, with a fairly elaborate control mechanism to
decide which conferences logically belong together.
3.4.2 Aggregation
Particularly within a network with a wide range of capacities, differing
multicast groups for each media component of a conference allows to
tailor the media distribution to the network bandwidths and end-system
capabilities. It appears useful, however, to have a means of identifying
groups that logically belong together, for example for purposes of time
synchronization.
A conference identifier used in this manner would have to be globally
H. Schulzrinne Expires 03/01/94 [Page 15]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
unique. It appears that such logical connections would better be identified
as part of the higher-layer control protocol by identifying all multicast
addresses belonging to the same logical conference, thereby avoiding the
assignment of globally unique identifiers.
3.5 Media Encoding Identification
This field plays a similar role to the protocol field in data link
or network protocols, indicating the next higher layer (here, the media
decoder) that the data is meant for. For RTP, this field would indicate the
audio or video or other media encoding. In general, the number of distinct
encodings should be kept as small as possible to increase the chance that
applications can interoperate. A new encoding should only be recognized
if it significantly enhances the range of media quality or the types of
networks conferences can be conducted over. The unnecessary proliferation
of encodings can be reduced by making reference implementations of standard
encoders and decoders widely available.
It should be noted that encodings may not be enumerable as easily as, say,
transport protocols. A particular family of related encoding methods may
be described by a set of parameters, as discussed below in the sections on
audio and video encoding.
Encodings may change during the duration of a conference. This may be
due to changed network conditions, changed user preference or because the
conference is joined by a new participant that cannot decode the current
encoding. If the information necessary for the decoder is conveyed
out-of-band, some means of indicating when the change is effective needs to
be incorporated. Also, the indication that the encoding is about to change
must reach all receivers reliably before the first packet employing the new
encoding. Each receiver needs to track pending changes of encodings and
check for every incoming packet whether an encoding change is to take effect
with this packet.
Conveying media encodings rapidly is also important to allow scanning of
conferences or broadcast media. Note that it is not necessary to convey
the whole encoder description, with all parameters; an index into a table of
well-known encodings is probably preferable. An index would also make it
easier to detect whether the encoding has changed.
Alternatively, a directory or announcement service could provide encoding
information for on-going conferences, without carrying the information in
every packet. This may not be sufficient, however, unless all participants
within a conference use the same encoding. As soon as the encoding
information is separated from the media data, a synchronization mechanism
has to be devised that ensures that sender and receiver interpret the data
in the same manner after the out-of-band information has been updated.
There are at least two approaches to indicating media encoding, either
H. Schulzrinne Expires 03/01/94 [Page 16]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
in-band or out-of-band:
conference-specific: Here, the media identifier is an index into a table
designating the approved or anticipated encodings (together with any
particular version numbers or other parameters) for a particular
conference or user community. The table can be distributed
through RTCP, a higher-layer conference control protocol, a conference
announcement service or some other out-of-band means. Since the number
of encodings used during a single conference is likely to be small, the
field width in the header can likewise be small. Also, there is no
need to agree on an Internet-wide list of encodings. It should be
noted that conveying the table of encodings through RTCP forces the
application to maintain a separate mapping table for each sender as
there can be no guarantee that all senders will use the same table.
Since the control protocol proposed here is unreliable, changing the
meaning of encoding indices dynamically is fraught with possibilities
for misinterpretation and lost data unless this mapping is carried in
every packet.
global: Here, the media identifier is an index into a global table
of encodings. A global list reduces the need for out-of-band
information. Transmitting the parameters associated with an encoding
may be difficult, however, if it has to be done within the header space
constraints of per-packet signaling.
To make detecting coder mismatches easier, encodings for all media should
be drawn from the same numbering space. To facilitate experimentation with
new encodings, a part of any global encoding numbering space should be
set aside for experimental encodings, with numbers agreed upon within the
community experimenting with the encoding, with no Internet-wide guarantee
of uniqueness.
3.5.1 Audio Encodings
Audio data is commonly characterized by three independent descriptors:
encoding (the translation of one or more audio samples into a channel
symbol), the number of channels (mono, stereo, :::) and the sampling rate.
Theoretically, sampling rate and encoding are (largely) independent. We
could, for example, apply mu-law encoding to any sampling rate even though
it is traditionally used with a rate of 8,000 Hz. In practical terms, it
may be desirable to limit the combinations of encoding and sampling rate to
the values the encoding was designed for.(2) Channel counts between 1 and
------------------------------
2. Given the wide availability of mu-law encoding and its low overhead,
using it with a sampling rate of 16,000 or 32,000 Hz might be quite
H. Schulzrinne Expires 03/01/94 [Page 17]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
6 should be sufficient even for surround sound.
The audio encodings listed in Table 1 appear particularly interesting,
even though the list is by no means exhaustive and does not include some
experimental encodings currently in use, for example a non-standard form of
LPC. The bit rate is shown per channel. k samples/s, b/sample and kb/s
denote kilosamples per second, bits per sample and kilobits per second,
respectively. If sampling rates are to be specified separately, the values
of 8, 16, 32, 44.1, and 48 kHz suggest themselves, even though other
values (11.025 and 22.05 kHz) are supported on some workstations (the
Silicon Graphics audio hardware and the Apple Macintosh, for example).
Clearly, little is to be gained by allowing arbitrary sampling rates, as
conversion particularly between rates not related by simple fractions is
quite cumbersome and processing-intensive [10].
Org. Name k samples/s b/sample kb/s description
CCITT G.711 8.0 8 64 mu-law PCM
CCITT G.711 8.0 8 64 A-law PCM
CCITT G.721 8.0 4 32 ADPCM
Intel DVI 8.0 4 32 APDCM
CCITT G.723 8.0 3 24 ADPCM
CCITT G.726 ADPCM
CCITT G.727 ADPCM
NIST/GSA FS 1015 8.0 2.4 LPC-10E
NIST/GSA FS 1016 8.0 4.8 CELP
NADC IS-54 8.0 7.95 N. American Digital Cellular, VSELP
CCITT G.728 8.0 16 LD-CELP
GSM 8.0 13 RPE-LTP
CCITT G.722 8.0 64 7 kHz, SB-ADPCM
ISO 3-11172 256 MPEG audio
32.0 16 512 DAT
44.1 16 705.6 CD, DAT playback
48.0 16 786 DAT record
Table 1: Standardized and common audio encodings
------------------------------
appropriate for high-quality audio conferences, even though there are other
encodings, such as G.722, specifically designed for such applications. Note
that the signal-to-noise ratio of mu-law encoding is about 38 dB, equivalent
to an AM receiver. The ``telephone quality'' associated with G.711 is due
primarily to the limitation in frequency response to the 200 to 3500 Hz
range.
H. Schulzrinne Expires 03/01/94 [Page 18]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
3.5.2 Video Encodings
Common video encodings are listed in Table 2. Encodings with tunable rate
can be configured for different rates, but produce a fixed-rate stream.
The average bit rate produced by variable-rate codecs depends on the source
material.
Org. name rate remarks
CCITT JPEG tunable
CCITT MPEG variable, tunable
CCITT H.261 tunable, px64 kb/s
Bolter variable, tunable
PictureTel ??
Cornell U. CU-SeeMe variable
Xerox Parc nv variable, tunable
BBN DVC variable, tunable block differences
Table 2: Common video encodings
3.6 Playout Synchronization
A major purpose of RTP is to provide the support for various forms of
synchronization, without necessarily performing the synchronization itself.
We can distinguish three kinds of synchronization:
playout synchronization: The receiver plays out the medium a fixed time
after it was generated at the source (end-to-end delay). This
end-to-end delay may vary from synchronization unit to synchronization
unit. In other words, playout synchronization assures that a constant
rate source at the sender again becomes a constant rate source at the
receiver, despite delay jitter in the network.
intra-media synchronization: All receivers play the same segment of a
medium at the same time. Intra-media synchronization may be needed
during simulations and wargaming.
inter-media synchronization: The timing relationship between several media
sources is reconstructed at the receiver. The primary example is
the synchronization between audio and video (lip-sync). Note that
different receivers may experience different delays between the media
generation time and their playout time.
Playout synchronization is required for most media, while intra-media and
inter-media synchronization may or may not be implemented. In connection
H. Schulzrinne Expires 03/01/94 [Page 19]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
with playout synchronization, we can group packets into playout units, a
number of which in turn form a synchronization unit. More specifically, we
define:
synchronization unit: A synchronization unit consists of one or more
playout units (see below) that, as a group, share a common fixed delay
between generation and playout of each part of the group. The delay
may change at the beginning of such a synchronization unit. The most
common synchronization units are talkspurts for voice and frames for
video transmission.
playout unit: A playout unit is a group of packets sharing a common
timestamp. (Naturally, packets whose timestamps are identical due
to timestamp wrap-around are not considered part of the same playout
unit.) For voice, the playout unit would typically be a single voice
segment, while for video a video frame could be broken down into
subframes, each consisting of packets sharing the same timestamp and
ordered by some form of sequence number.
Two concepts related to synchronization and playout units are absolute and
relative timing. Absolute timing maintains a fixed timing relationship
between sender and receiver, while relative timing ensures that the spacing
between packets at the sender is the same as that at the receiver, measured
in terms of the sampling clock. Playout units within the synchronization
unit maintain relative timing with respect to each other; absolute timing is
undesirable if the receiver clock runs at a (slightly) different rate than
the sender clock.
Most proposed synchronization methods require a timestamp. The timestamp
has to have a sufficient range that wrap-arounds are infrequent. It
is desirable that the range exceeds the maximum expected inactive (e.g.,
silence) period. Otherwise, if the silence period lasts a full timestamp
range, the first packet of the next talkspurt would have a timestamp one
larger than the last packet of the current talkspurt. In that case, the
new talkspurt could not be readily discerned if the difference in increment
between timestamps and sequence numbers is used to detect a new talkspurt.
The 10-bit timestamp used by NVP is generally agreed to be too small as it
wraps around after only 20.5 s (for 20 ms audio packets), while a 32-bit
timestamp should serve all anticipated needs, even if the timestamp is
expressed in units of samples or other sub-packet entities.
A timestamp may be useful not only at the transport, but also at the network
layer, for example, for scheduling packets based on urgency. The playout
timestamp would be appropriate for such a scheduling timestamp, as it would
better reflect urgency than a network-level departure timestamp. Thus, it
may make sense to use a network-level timestamp such as the one provided by
ST-II at the transport layer.
H. Schulzrinne Expires 03/01/94 [Page 20]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
3.6.1 Synchronization Methods
The necessary header components are determined to some extent by the method
of synchronizing sender and receivers. In this section, we formally
describe some of the popular approaches, building on the exposition and
terminology of Montgomery [11].
We define a number of variables describing the synchronization process. In
general, the subscript n represents the nth packet in a synchronization
unit, n=1;2;:::. Let a , d , p and t be the arrival time, variable
n n n n
delay, playout time and generation time of the nth packet, respectively.
Let o denote the fixed delay from sender to receiver. Finally, d
max
describes the estimated maximum variable delay within the network. The
estimate is typically chosen in such a way that only a very small fraction
(on the order of 1%) of packets take more than o+d time units. For best
max
performance under changing network load conditions, the estimate should be
refined based on the actual delays experienced. The variable delay in a
network consists of queueing and media access delays, while propagation and
processing delays make up the fixed delay. Additional end-to-end fixed
delay is unavoidably introduced by packetization; the non-real-time nature
of most operating systems adds a variable delay both at the transmitting and
receiving end. All variables are expressed in sample unit of time, be
that seconds or samples, for example. For simplicity, we ignore that the
sender and receiver clocks may not run at exactly the same speed. The
relationship between the variables is depicted in Fig. 2. The arrows in the
figure indicate the transmission of the packet across the network, occurring
after the packetization delay. The packet with sequence number 5 misses the
playout deadline and, depending on the algorithm used by the receiver, is
either dropped or treated as the beginning of a new talkspurt.
Figure only available in PostScript version of document.
Figure 2: Playout Synchronization Variables
Given the above definitions, the relationship
a =t +d +o (1)
n n n
holds for every packet. For brevity, we also define l as the ``laxity''
n
of packet n, i.e., the time p -a between arrival and playout. Note that
n n
it may be difficult to measure a with resolution below a packetization
n
interval, particularly if the measurement is to be in units related to the
playback process (e.g., samples). All synchronization methods differ only
in how much they delay the first packet of a synchronization unit. All
packets within a synchronization unit are played out based on the position
of the first packet:
p =p +(t -t ) for n>1
n n
n-1 n-1
Three synchronization methods are of interest. We describe below how they
compute the playout time for the first packet in a synchronization unit and
H. Schulzrinne Expires 03/01/94 [Page 21]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
what measurement is used to update the delay estimate d .
max
blind delay: This method assumes that the first packet in a talkspurt
experiences only the fixed delay, so that the full d has to be
max
added to allow for other packets within the talkspurt experiencing more
delay.
p =a +d : (2)
max
1 1
The estimate for the variable delay is derived from measurements
of the laxity l , so that the new estimate after n packets is
n
computed d =f(l ;:::;l ), where the function f(.) is a suitably
max;n n
1
chosen smoothing function. Note that blind delay does not require
timestamps to determine p , only an indication of the beginning of
1
a synchronization unit. Timestamps may be required to compute p ,
n
however, unless t -t is a known constant.
n
n-1
absolute timing: If the packet carries a timestamp measured in time units
known to the receiver, we can improve our determination of the playout
point:
p =t +o+d :
max
1 1
This is, clearly, the best that can be accomplished. Here, instead of
estimating d , we estimate o+d as some function of p -t . For
max max n n
this computation, it does not matter whether p and t are measured with
clocks sharing a common starting point.
added variable delay: Each node adds the variable delay experienced within
it to a delay accumulator within the packet, yielding d .
n
p =a -d +d
max
1 1 1
From Eq. 1, it is readily apparent that absolute delay and added
variable delay yield the same playout time. The estimate for d is
max
based on the measurements for d. Given a clock with suitably high
resolution, these estimates can be better than those based on the
difference between a and p; however, it requires that all routers can
recognize RTP packets. Also, determining the residence time within a
router may not be feasible.
In summary, absolute timing is to be preferred due to its lower delays
compared to blind delay, while synchronization using added variable delays
is currently not feasible within the Internet (it is, however, used for
G.764).
3.6.2 Detection of Synchronization Units
The receiver must have a way of readily detecting the beginning of a
synchronization unit, as the playout scheduling of the first packet in a
synchronization unit differs from that in the remainder of the unit. This
H. Schulzrinne Expires 03/01/94 [Page 22]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
detection has to work reliably even with packet reordering; for example,
reordering at the beginning of a talkspurt is particularly likely since
common silence detection algorithms send a group of stored packets at the
beginning of the talkspurt to prevent front clipping.
Two basic methods have been proposed:
timestamp and sequence number: The sequence number increases by one with
each packet transmitted, while the timestamp reflects the total time
covered, measured in some appropriate unit. A packet is declared to
start a new synchronization unit if (a) it has the highest timestamp
and sequence number seen so far (within this wraparound cycle) and
(b) the difference in timestamp values (converted into a packet count)
between this and the previous packet is greater than the difference in
sequence number between those two packets.
This approach has the disadvantage that it may lead to erroneous packet
scheduling with blind delay if packets are reordered. An example is
shown in Table 3. In the example, the playout delay is set at 50 time
units for blind timing and 550 time units for absolute timing. The
packet intergeneration time is 20 time units.
blind timing absolute timing
no reordering with reordering
seq. timestamp arrival playout arrival playout arrival playout
200 1020 1520 1570 1520 1570 1520 1570
201 1040 1530 1590 1530 1590 1530 1590
202 1220 1720 1770 1725 1750 1725 1770
203 1240 1725 1790 1720 1770 1720 1790
204 1260 1792 1810 1791 1790 1791 1810
Table 3: Example where out-of-order arrival leads to packet loss for blind
timing
More significantly, detecting synchronization units requires that the
playout mechanism can translate timestamp differences into packet
counts, so that it can compare timestamp and sequence number
differences. If the timespan ``covered'' by a packet changes with
the encoding or even varies for each packet, this may be cumbersome.
NVP provides the timestamp/sequence number combination for detecting
talkspurts. The following method avoids these drawbacks, at the cost
of one additional header bit.
synchronization bit: The beginning of a synchronization unit is indicated
by setting a synchronization bit within the header. The receiver,
however, can only use this information if no later packet has already
been processed. Thus, packet reordering at the beginning of a
talkspurt leads to missing opportunities for delay adjustment. With
H. Schulzrinne Expires 03/01/94 [Page 23]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
the synchronization bit, a sequence number is not necessary to detect
the beginning of a synchronization unit, but a sequence number remains
useful for detecting packet loss and ordering packets bearing the same
timestamp. With just a timestamp, it is impossible for the receiver
to get an accurate count of the number of packets that it should have
received. While gaps within a talkspurt give some indication of packet
loss, the receiver cannot tell what part of the tail of a talkspurt
has been transmitted. (Example: consider the talkspurts with time
stamps 100, 101, 102, 110, 111. Packets with timestamp 100 and 110
have the synchronization bit set. The receiver has no way of knowing
whether it was supposed to have received two talkspurts with a total of
five packets, or two or more talkspurts with up to 12 packets.) The
synchronization bit is used by vat, without a sequence number. It is
also contained in the original version of NVP [12]. A special sequence
number, as used by G.764, is equivalent.
3.6.3 Interpretation of Synchronization Bit
Two possibilities for implementing a synchronization bit are discussed here.
start of synchronization unit: The first packet in a synchronization unit
is marked with a set synchronization bit. With this use of
the synchronization bit, the receiver detects the beginning of a
synchronization unit with the following simple algorithm:
if synchronization bit = 1
and current sequence number > maximum sequence number seen so far
then
this packet starts a new synchronization unit
if current sequence number > maximum sequence number
then
maximum sequence number := current sequence number
endif
Comparisons and arithmetic operations are modulo the sequence number
range.
end of synchronization unit: The last packet in a synchronization unit is
marked. As pointed out elsewhere, this information may be useful
for initiating appropriate fill-in during silence periods and to start
processing a completed video frame. If a voice silence detector uses
no hangover, it may have difficulty deciding which is the last packet
in a talkspurt until it judges the first packet to contain no speech.
The detection of a new synchronization unit by the receiver is only
slightly more complicated than with the previous method:
H. Schulzrinne Expires 03/01/94 [Page 24]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
if sync_flag then
if sequence number >= sync_seq then
sync_flag := FALSE
endif
if sequence number = sync_seq then
signal beginning of synchronization unit
endif
endif
if synchronization bit = 1 then
sync_seq := sequence number + 1
sync_flag := TRUE
endif
By changing the equal sign in the second comparison to 'if sequence
number > syncseq', a new synchronization unit is detected even if
packets at the beginning of the synchronization unit are reordered. As
reordering at the beginning of a synchronization unit is particularly
likely, for example when transmitting the packets preceding the
beginning of a talkspurt, this should significantly reduce the number
of missed talkspurt beginnings.
3.6.4 Interpretation of Timestamp
Several proposals as to the interpretation of the timestamp have been
advanced:
packet or frame interval: Each packetization or (video/audio) frame inter-
val increments the timestamp. This approach very efficient in terms
of processing and bit-use, but cannot be used without out-of-band
information if the time interval of media ``covered'' by a packet
varies from packet to packet. This occurs for example with
variable-rate encoders or if the packetization interval is changed
during a conference. This interpretation of a timestamp is assumed by
NVP, which defines a frame as a block of PCM samples or a single LPC
frame. Note that there is no inherent necessity that all participants
within a conference use the same packetization interval. Local
implementation considerations such as available clocks may suggest
different intervals. As another example, consider a conference with
feedback. For the lecture audio, a long packetization interval may
be desirable to better amortize packet headers. For side chats,
delays are more important, thus suggesting a shorter packetization
interval.(3)
------------------------------
3. Nevot for example, allows each participant to have a different
packetization interval, independent of the packetization interval used by
H. Schulzrinne Expires 03/01/94 [Page 25]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
sample: This method simply counts samples, allowing a direct translation
between time stamp and playout buffer insertion point. It is just
as easily computable as the per-packet timestamp. However, for some
media and encodings(4) , it may not be quite clear what a sample is.
Also, some care must be taken at the receiver and sender if streams use
different sampling rates. This method is currently used by vat.
Milliseconds: A timestamp incremented every millisecond would wrap around
once every 49 days. The resolution is sufficient for most
applications, except that the natural packetization interval for
LPC-coded speech is 22.5 ms. Also, with a video frame rate of 30 Hz,
an internal timestamp of higher resolution would need to be truncated
to millisecond resolution to approximate 33.3 ms intervals. This time
increment has the advantage of being used by some Unix delay functions,
which might be useful for playing back video frames with proper timing.
It might be useful to take the second value from the current system
clock to allow delay estimates for synchronized clocks.
subset of NTP timestamp: 16 bits encode seconds relative to midnight (0
hours), January 1, 1900 (modulo 65536) and 16 bits encode fractions of
a second, with a resolution of approximately 15.2 microseconds, which
is smaller than any anticipated audio sampling or video frame interval.
This timestamp is the same as the middle 32 bits of the 64-bit NTP
timestamp [13]. It wraps around every 18.2 hours. If it should be
desirable to reconstruct absolute transmission time at the receiver for
logging or recording purposes, it should be easy to determine the most
significant 16 bits of the timestamp. Otherwise, wrap-arounds are not
a significant problem as long as they occur 'naturally', i.e., at a 16
or 32 bit boundary, so that explicit checking on arithmetic operations
is not required. Also, since the translation mechanism would probably
treat the timestamp as a single integer without accounting for its
division into whole and fractional part, the exact bit allocation
between seconds and fractions thereof is less important. However,
the 16/16 approach simplifies extraction from a full NTP timestamp.
Sixteen bits of fractional seconds also allows a timestamp without
wrap-around, i.e, with 32 bits of full seconds encoding time since
January 1, 1990, to fit into the 52 bits of a IEEE floating point
number.
The NTP-like timestamp has the disadvantage that its resolution does
not map into any of the common sample or packetization intervals.
Thus, there is a potential uncertainty of one sample at the receiver
------------------------------
Nevot for its outgoing audio. Only the packetization interval for outgoing
audio for all conferences this Nevot participates in must be the same.
4. Examples include frame-based encodings such as LPC and CELP. Here, given
that these encodings are based on 8,000 Hz input samples, the preferred
interpretation would probably be in terms of audio samples, not frames, as
samples would be used for reconstruction and mixing.
H. Schulzrinne Expires 03/01/94 [Page 26]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
as to where to place the beginning of the received packet, resulting
in the equivalent of a one-sample slip. CCITT recommendation G.821
postulates a mean slip rate of less than 1 slip in 5 hours, with
degraded but acceptable service for less than 1 slip in 2 minutes.
Tests with appropriate rounding conducted by the author showed that
this uncertainty is not likely to cause problems. In any event, a
double-precision floating point multiplication is needed to translate
between this timestamp and the integer sample count available on
transmission and required for playout.(5)
MPEG timestamps: MPEG uses a 33 bit clock with a resolution of 90 kHz [14]
as the system clock reference and for presentation time stamps. The
frequency was chosen based on the divisibility by the nominal video
picture rates of 24 Hz, 25 Hz, 29.97 Hz and 30 Hz [14, p.42]. The
frequency would also fit nicely with the 20 ms audio packetization
interval. The length of 33 bit is clearly inappropriate, however, for
software implementations. 32 bit timestamps still cover more than half
a day and thus can be readily extended to full unique timestamps or 33
bits if needed.
Microseconds: A 32-bit timestamp incremented every microsecond wraps around
once every 71.5 minutes. The resolution is high enough that round-off
errors for video frame intervals and such should be tolerable without
maintaining a higher-precision internal counter. This resolution is
also provided, at least nominally, by the Unix gettimeofday() system
call.
QuickTime: The Apple QuickTime file format is a generalization of the
previous formats as it combines a 32-bit counter with a 32-bit media
time scale expressed in time units per second. The four previously
mentioned timestamps can be represented by time scales of 1000, 65536,
90,000 and 1,000,000. For the sample and packet-based case, the value
would depend on the media content, e.g., 8,000 for standard PCM-coded
audio.
Timestamps based on wallclock time rather than samples or frames have the
advantage that a receiver does not necessarily need to know about the
meaning of the encoding contained in the packet in order to process the
timestamp. For example, a quality-of-service monitor within the network
could measure delay variance easily, without caring what kind of audio
information, say, is contained in the packet. Other tools, such as a
recording and playback tool, can also be written without concern about the
mapping between timestamp and wallclock units.
------------------------------
5. The multiplication with an appropriate factor can be approximated
to the desired precision by an integer multiplication and division, but
multiplication by a floating point value is actually much faster on some
modern processors.
H. Schulzrinne Expires 03/01/94 [Page 27]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
A time stamp could reflect either real time or sample time. A real time
timestamp is defined to track wallclock time plus or minus a constant
offset. Sample time increases by the nominal sampling interval for each
sample. The two clocks in general do not agree since the clock source used
for sampling will in all likelihood be slightly off the nominal rate. For
example, typical crystals without temperature control are only accurate to
50 -- 100 ppm (parts per million), yielding a potential drift of 0.36
seconds per hour between the sampling clock and wallclock time.
It has been suggested to use timestamps relative to the beginning of
first transmission from a source. This makes correlation between media
from different participants difficult and seems to have no technical or
implementation advantages, except for avoiding wrap-around during most
conferences. As pointed out above, that seems to be of little benefit.
Clearly, the reliability of a wallclock-synchronized timestamps depends on
how closely the system clocks are synchronized, but that does not argue for
giving up potential real-time synchronization in all cases.
Using real time rather than sample time allows for easier synchronization
between different media and users (e.g., during playback of a recorded
conference) and to compensate for slow or fast sample clocks. Note that it
is neither desirable nor necessary to obtain the wall clock time when each
packet was sampled. Rather, the sender determines the wallclock time at the
beginning of each synchronization unit (e.g., a talkspurt for voice and a
frame for video) and adds the nominal sample clock duration for all packets
within the talkspurt to arrive at the timestamp value carried in packets.
The real time at the beginning of a talkspurt is determined by estimating
the true sample rate for the duration of the conference.
The sample rate estimate has to be accurate enough to allow placing the
beginning of a talkspurt, say, to within at most 50 to 100 ms, otherwise the
lack of synchronization may be noticeable, delay computations are confused
and successive talkspurts may be concatenated.
Estimating the true sampling instant to within a few milliseconds is
surprisingly difficult for current operating systems. The sample rate r can
to be estimated as
s+q
r= :
t-t
0
Here, t is the current time, t the time elapsed since the first sample
0
was acquired, s is the number of samples read, q is the number of samples
ready to be read (queued) at time t. Let p denote the number of samples
in a packet. The timestamp in the synchronization packet reflects the
sampling instant of the first sample of that packet and is computed as
t-(p+q)=r. Unfortunately, only s and p are known precisely. The accuracy
of the estimate for t and t depend on how accurately the beginning of
0
sampling and the last reading from the audio device can be measured. There
is a non-zero probability that the process will get preempted between the
time the audio data is read and the instant the system clock is sampled.
It remains unclear whether indications of current buffer occupancy, if
available, can be trusted. Even with increasing sample count, the absolute
H. Schulzrinne Expires 03/01/94 [Page 28]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
accuracy of the timestamp is roughly the same as the measurement accuracy of
t, as differentiating with respect to t shows. Experiments with the SunOS
audio driver showed significant variations of the estimated sample rate,
with discontinuities of the computed timestamps of up to 25 ms. Kernel
support is probably required for meaningful real time measurements.
Sample time increments with the sampling interval for every sample or
(sub)frame received from the audio or video hardware. It is easy to
determine, as long as care is taken to avoid cumulative round-off errors
incurred by simply repeatedly adding the approximate packetization interval.
However, synchronization between media and end-to-end delay measurements are
then no longer feasible. (Example: Consider an audio and a video stream.
If the audio sample clock is slightly faster than the real clock and the
video sampling clock, a video and audio frame belonging together would be
marked by different timestamps, thus played out at different instants.)
If we choose to use sample time, the advantage of using an NTP-format
timestamp disappears, as the receiver can easily reconstruct a NTP
sample-based timestamp from the sample count if needed, but would not have
to if no cross-media synchronization is required. RTCP could relate the
time increment per sample in full precision. The definition of a ``sample''
will depend on the particular medium, and could be a audio sample, a video
or a voice frame (as produced by a non-waveform coder). The mapping fails
if there is no time-invariant mapping between sample units and time.
It should be noted that it may not be possible to associate an meaningful
notion of time with every packet. For example, if a video frame is
broken into several fragments, there is no natural timestamp associated
with anything but the first fragment, particularly if there is not even
a sequential mapping from screen scan location into packets. Thus, any
timestamp used would be purely artificial. A synchronization bit could be
used in this particular case to mark beginning of synchronization units.
For packets within synchronization units, there are two possible approaches:
first, we can introduce an auxiliary sequence number that is only used to
order packets within a frame. Secondly, we could abuse the timestamp field
by incrementing it by a single unit for each packet within the frame, thus
allowing a variable number of frames per packet. The latter approach is
barely workable and rather kludgy.
3.6.5 End-of-talkspurt indication
An end-of-talkspurt indication is useful to distinguish silence from lost
packets. The receiver would want to replace silence by an appropriate
background noise level to avoid the ``noise-pumping'' associated with
silence detection. On the other hand, missing packets should be
reconstructed from previous packets. If the silence detector makes use
of hangover, the transmitter can easily set the end-of-talkspurt indicator
on the last bit of the last hangover packet. If the talkspurts follow
end-to-end, the end-of-talkspurt indicator has no effect except in the
H. Schulzrinne Expires 03/01/94 [Page 29]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
case where the first packet of a talkspurt is lost. In that case, the
indicator would erroneously trigger noise fill instead of loss recovery.
The end-of-talkspurt indicator is implemented in G.764 as a ``more'' bit
which is set to one for all but the last packet within a talkspurt.
3.6.6 Recommendation
Given the ease of cross-media synchronization and the media independence,
the use of 32-bit 16/16 timestamps representing the middle part of the NTP
timestamp is suggested. Generally, a wallclock-based timestamp appears
to be preferable to a sample-based one, but it may only be approximately
realizable for some current operating systems. Inter-media synchronization
to below 10 to 20 ms has to await mechanisms that can accurately determine
when a particular sample was actually received by the A/D converter.
Particularly with sample- or wallclock-based timestamp, a synchronization
bit simplifies the detection of the beginning of a synchronization unit.
Indicating either the end or beginning of a synchronization unit is roughly
equivalent, with tradeoffs between the two.
3.7 Segmentation and Reassembly
For high-bandwidth video, a single frame may not fit into the maximum
transport unit (MTU). Thus, some form of frame sequence number is needed.
If possible, the same sequence number should be used for synchronization and
fragmentation. Six possibilities suggest themselves:
overload the timestamp: No sequence number is used. Within a frame, the
timestamp has no meaning. Since it is used for synchronization only
when the synchronization bit is set, the other timestamps can just
increase by one for each packet. However, as soon as the first
frame gets lost or reordered, determining positions and timing becomes
difficult or impossible.
packet count: The sequence number is incremented for every packet, without
regard to frame boundaries. If a frame consists of a variable number
of packets, it may not be clear what position the packet occupies
within the frame if packets are lost or reordered. Continuous sequence
numbers make it possible to determine if all packets for a particular
frame have arrived, but only after the first packet of the next frame,
distinguished by a new timestamp, has arrived.
packet count within a frame: The sequence number is reset to zero at the
beginning of each frame. This approach has properties complementary to
continuous sequence numbers.
H. Schulzrinne Expires 03/01/94 [Page 30]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
packet count and first-packet sequence number: Packets use a continuously
incrementing sequence number plus an option field in every packet
indicating the initial sequence number within the playout unit(6) .
Carrying both a continuous and packet-within-frame count achieves the
same effect.
packet count with last-packet sequence number: Packets carry a continuous
sequence number plus an option in every packet indicating the last
sequence number within the playout unit. This has the advantage that
the receiver can readily detect when the last packet for a playout unit
has been received. The transmitter may not know, however, at the
beginning of a playout unit how many packets it will comprise. Also,
the position within the playout unit is more difficult to determine if
the initial packet and the previous frame is lost.
packet count and frame count: The sequence number counts packets, without
regard to frame boundaries. A separate counter increments with each
frame. Detecting the end of a frame is delayed until the first packet
belonging to the next frame. Also, the frame count cannot help to
determe the position of the packet within a frame.
It could be argued that encoding-specific location information should be
contained within the media part, as it will likely vary in format and use
from one media to the next. Thus, frame count, the sequence number of the
last or first packet in a frame etc. belong into the media-specific header.
The size of the sequence number field should be large enough to allow
unambiguous counting of expected vs. received packets. A 16-bit sequence
number would wrap around every 20 minutes for a 20 ms packetization
interval. Using 16 bits may also simplify modulo arithmetic.
3.8 Source Identification
3.8.1 Bridges, Translators and End Systems
It is necessary to be able to identify the origin of the real-time data in
terms meaningful to the application. First, this is required to demultiplex
sites (or sources) within the same conference. Secondly, it allows an
indication of the currently active source.
Currently, NVP makes no explicit provisions for this, assuming that the
network source address can be used. This may fail if intermediate agents
intervene between the content source and final destination. Consider the
example in Fig. 3. An RTP-level bridge is defined as an entity that
------------------------------
6. suggested by Steve Casner
H. Schulzrinne Expires 03/01/94 [Page 31]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
transforms either the RTP header or the RTP media data or both. Such
a bridge could for example merge two successive packets for increased
transport efficiency or, probably the most common case, translate media
encodings for each stream, say from PCM to LPC (called transcoding).
A synchronizing bridge is defined here as a bridge that recreates a
synchronous media stream, possibly after mixing several sources. An
application that mixes all incoming streams for a particular conference,
recreates a synchronous audio stream and then forwards it to a set of
receivers is an example of a synchronizing bridge. A synchronizing bridge
could be built from two end system applications, with the first application
feeding the media output to the media input of the second application and
vice versa.
In figure 3, the bridges are used to translate audio encodings, from PCM
and ADPCM to LPC. The bridge could be either synchronizing or not. Note
that a resynchronizing bridge is only necessary if audio packets depend on
their predecessors and thus cannot be transcoded independently. It may be
advantageous if the packetization interval can be increased. Also, for low
speed links that are barely able to handle one active source at a time,
mixing at the bridge avoids excessive queueing delays when several sources
are active at the same time. A synchronizing bridge has the disadvantage
that it always increases the end-to-end delay.
We define translators as transport-level entities that translate between
transport protocols, but leave the RTP protocol unit untouched. In the
figure, the translator connects a multicast group to a group of hosts that
are not multicast capable by performing transport-level replication.
We define an end system as an entity that receives and generates media
content, but does not forward it.
We define three types of sources: the content source is the actual origins
of the media, e.g., the talker in an audiocast; a synchronization source
is the combination of several content sources with its own timing; network
source is the network-level origin as seen by the end system receiving the
media.
The end system has to synchronize its playout with the synchronization
source, indicate the active party according to the content source and return
media to the network source. If an end system receives media through a
resynchronizing bridge, the end system will see the bridge as the network
and synchronization source, but the content sources should not be affected.
The translator does not affect the media or synchronization sources, but the
translator becomes the network source. (Note that having the translator
change the IP source address is not possible since the end systems need
to be able to return their media to the translator.) In the (common)
case where no bridge or translator intercepts packets between sender and
receiver, content, synchronization and network source are identical. If
there are several bridges or translators between sender and receiver, only
the last one is visible to the receiver.
H. Schulzrinne Expires 03/01/94 [Page 32]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
/-------" +------+
| | ADPCM | |
| group |<------>| GW |--" LPC
| | | | " /------ end system
"-------/ +------+ "|"/
reflector | >------- end system
/-------" +------+ /|/"
| | PCM | | / "------ end system
| group |<------>| GW |--/ LPC
| | | |
"-------/ +------+
<---> multicast
Figure 3: Bridge topology
vat audio packets include a variable-length list of at most 64 4-byte
identifiers containing all content sources of the packet. However, there is
no convenient way to distinguish the synchronization source from the network
source. The end system needs to be able to distinguish synchronization
sources because jitter computation and playout delay differ for each
synchronization source.
3.8.2 Address Format Issues
The limitation to four bytes of addressing information may not be desirable
for a number of reasons. Currently, it is used to hold an IP address. This
works as long as four bytes are sufficient to hold an identifier that is
unique throughout the conference and as long as there is only one media
source per IP address. The latter assumption tends to be true for many
current workstations, but it is easy to imagine scenarios where it might not
be, e.g., a system could hold a number of audio cards, could have several
audio channels (Silicon Graphics systems, for example) or could serve as a
multi-line telephone interface.(7)
The combination of IP address and source port can identify multiple sources
per site if each content source uses a different source port. For a small
number of sources, it appears feasible, if inelegant, to allocate ports just
to distinguish sources. In the PBX example a single output port would
appear to be the appropriate method for sending all incoming calls across
the network. The mechanisms for allocating unique file names could also be
used. The difficult part will be to convince all applications to draw from
------------------------------
7. If we are willing to forego the identification with a site, we could
have a multiple-audio channel site pick unused IP addresses from the local
network and associate it with the second and following audio ports.
H. Schulzrinne Expires 03/01/94 [Page 33]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
the same numbering space.
For efficiency in the common case of one source per workstation, the
convention (used in vat) of using the network source address, possibly
combined with the user id or source port, as media and synchronization
source should be maintained.
There are several possible approaches to naming sources. We compare here
two examples representing naming through globally unique network addresses
and through a concatenation of locally unique identifiers.
The receiver needs to be able to uniquely identify the content source so
that speaker indication and labeling work. For playout synchronization, the
synchronization source needs to be determined. The identification mechanism
has to continue to work even if the path between sender and receiver
contains multiple bridges and translators.
Also, in the common case of no bridges or translators, the only information
available at the receiver is the network address and source port. This
can cause difficulties if there is more than one participant per host in a
certain conference. If this can occur, it is necessary that the application
opens two sockets, one for listening bound to the conference port number and
one for sending, bound to some locally unique port. That randomly chosen
port should also be used for reverse application data, i.e., requests from
the receiver back to the content source. Only the listening socket needs
to be a member of the IP multicast group. If an application multiplexes
several locally generated sources, e.g., an interface to an audio bridge,
it should follow the rules for bridges, that is, insert content source
information.
3.8.3 Globally unique identifiers
Sources are identified by their network address and the source port number.
The source port number rather than some other integer has to be chosen for
the common case that RTP packets contain no SSRC or CSRC options. Since
the SDES option contains an address, it has to be the network address
plus source port, no other information being available to the receiver
for matching. (The SDES address is not strictly needed unless a bridge
with mixing is involved, but carrying it keeps the receiver from having
to distinguish those cases.) Since tying a protocol too closely to one
particular network protocol is considered a bad idea (witness the difficulty
of adopting parts of FTP for non-IP protocols), the address should probably
have the form of a type-lenght-value field. To avoid having to manage yet
another name space, it appears possible to re-use the Ethertype values, as
all commonly used protocols with their own address space appear to have been
assigned such a value. Other alternatives, such as using the BSD Unix
AF constants suffer from the drawback that there does not appear to be a
universally agreed-upon numbering. NSAPs can contain other addresses, but
not every address format (such as IP) has an NSAP representation. The
H. Schulzrinne Expires 03/01/94 [Page 34]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
receiver application does not need to interpret the addresses themselves; it
treats address format identifier (e.g., the Ethertype field) and address as
a globally unique byte string. We have to assure a single host does not use
two network addresses, one for transmission and a different one in the SDES
option.
The rules for adding CSRC and SSRC options are simple:
end system: End systems do not insert CSRC or SSRC options. The receiver
remembers the CSRC address for each site; if none is explicitly
specified, the SSRC address is used. If that is also missing, the
network address is used. SDES options are matched to this content
source address.
bridge: A bridge adds the network source address of all sources
contributing to a particular outgoing packet as CSRC options. A bridge
that receives a packet containing CSRC options may decide to copy those
CSRC options into an outgoing packet that contains data from that
bridge.
translator: The translator checks whether the packet already contains a
SSRC (inserted by an earlier translator). If so, no action is
required. Otherwise, the translator inserts an SSRC containing the
network address of the host from which the packet was received.
The SSRC option is set only by the translator, unless the packet already
bears such an option.
Globally unique identifiers based on network addresses have the advantage
that they simplify debugging, for example, allowing to determine which
bridge processed a message, even after the packet has passed through a
translator.
3.8.4 Locally unique addresses
In this scheme, the SSRC, CSRC and SDES options contain locally unique
identifiers of some length. For lengths of at least four bytes, it
is sufficient to have the application pick one at random, without local
coordination, with sufficiently low probability of collision within a single
host. The receiver creates a globally unique identifier by concatenating
the network address and one or more random identifiers. The synchronization
source is identified by the concatenation of the SSRC identifier and the
network address. Only translators are allowed to set the SSRC option. If a
translator receives an RTP packet which already contains an SSRC option, as
can occur if a packet traverses several translators, the translator has to
choose a new set of values, mapping packets with the same network source,
but different incoming SSRC value into different outgoing SSRC values. Note
H. Schulzrinne Expires 03/01/94 [Page 35]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
that the SSRC constitute a label-swapping scheme similar to that used for
ATM networks, except that the assocation setup is implicit. If a translator
loses state (say, after rebooting), the mapping is simply reestablished as
packets arrive from end systems or other translators. Until the receivers
timeout, a single source may appear twice and there may be a temporary
confusion of sources and their descriptors.
The rules are:
end system: An end system never inserts CSRC options and typically does not
insert an SSRC option. An end system application may insert an SSRC
option if it originates more than one stream for a single conference
through a single network and transport address, e.g., a single UDP
port. The SDES option contains a zero for the identifier, indicating
that the receiver is to much on network address only. The receiver
determines the synchronization source as the concatenation of network
source and synchronization source.
bridge: A bridge assigns each source its own CSRC identifier (non-zero),
which is then used also in the SDES option.
translator: The translator maintains a list of all incoming sources, with
their network and SSRC, if present. Sources without SSRC are assigned
an SSRC equal to zero. Each of these sources is assigned a new local
identifier, which is then inserted into the SSRC option.
Local identifiers have advantages: the length of the identifiers within
the packet are significantly shorter (four to six vs. at least ten
bytes with padding); comparison of content and synchronization source
are quicker (integer comparison vs. variable-length string comparison).
The identifiers are meaningless for debugging. In particular, it is
not easy for the receiver sitting behind a translator and a bridge to
determine where a bridge is located, unless the bridge identifies itself
periodically, possibly with another SDES-like option containing the actual
network address.
The major drawbacks appear to be the additional translator complexity:
translators needs to maintain a mapping from incoming network/SSRC to
outgoing SSRC.
Note that using IP addresses as ``random'' local identifiers is not workable
if there is any possibility that two sources participating in the same
conference can coexist on the same host.
A somewhat contrived scenaria is shown in Fig. 4.
H. Schulzrinne Expires 03/01/94 [Page 36]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
Figure only available in PostScript version.
Figure 4: Complicated topology with translators (R) and bridges (G)
3.9 Energy Indication
G.764 contains a 4-bit noise energy field, which encodes the white noise
energy to be played by the receiver in the silences between talkspurts.
Playing silence periods as white noise reduces the noise-pumping where the
background noise audible during the talkspurt is audibly absent at the
receiver during silence periods. Substituting white noise for silence
periods at the receiver is not recommended for multi-party conferences, as
the summed background noise from all silent parties would be distractive.
Determining the proper noise level appears to be difficult. It is suggested
that the receiver simply takes the energy of the last packet received before
the beginning of a silence period as an indication of the background noise.
With this mechanism, an explicit indication in the packet header is not
required.
3.10 Error Control
In principle, the receiver has four choices in handling packets with bit
errors [15]:
no checking: the receiver provides no indication whether a data packet
contains bit errors, either because a checksum is not present or is not
checked.
discard: the receiver discards errored packets, with no indication to the
application.
receive: the receiver delivers and flags errored packets to the
application.
correct: the receiver drops errored packets and requests retransmission.
It remains to be decided whether the header, the whole packet or neither
should be protected by checksums. NVP protects its header only, while G.764
has a single 16-bit check sequence covering both datalink and packet voice
header. However, if UDP is used as the transport protocol, a checksum over
the whole packet is already computed by the receiver. (Checksumming for UDP
can typically be disabled by the sending or receiving host, but usually not
on a per-port basis.) ST-II does not compute checksums for its payload.
Many data link protocols already discard packets with bit errors, so that
H. Schulzrinne Expires 03/01/94 [Page 37]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
packets are rarely rejected due to higher-layer checksums.
Bit errors within the data part may be easier to tolerate than a lost
packet, particularly since some media encoding formats may provide built-in
error correction. The impact of bit errors within the header can vary; for
example, errors within the timestamp may cause the audio packet to be played
out at the wrong time, probably much more noticeable than discarding the
packet. Other noticeable effects are caused by a wrong flow or encoding
identifier. If a separate checksum is desired for the cases where the
underlying protocols do not already provide one, it should be optional.
Once optional, it would be easy to define several checksum options, covering
just the header, the header plus a certain part of the body or the whole
packet.
A checksum can also be used to detect whether the receiver has the correct
decryption key, avoiding noise or (worse) denial-of-service attacks. For
that application, the checksum should be computed across the whole packet,
before encrypting the content. Alternatively, a well-known signature could
be added to the packet and included in the encryption, as long as known
plaintext does not weaken the encryption security.
Embedding a checksum as an option may lead to undiscovered errors if
the the presence of the checksum is masked by errors. This can occur
in a number of ways, for example by an altered option type field, a
final-option bit erroneously set in options prior to the checksum option or
an erroneous field length field. Thus, it may be preferable to prefix
the RTP packet with a checksum as part of the specification of running
RTP over some network or transport protocol. To avoid the overhead of
including a checksum even in the common case where it is not needed, it
might be appropriate to distinguish two RTP protocol variations through the
next-protocol value in the lower-layer protocol header; the first would
include a checksum, the second would not. The checksum itself offers a
number of encoding possibilities(8) :
o have two 16-bit checksums, one covering the header, the other the data
part
o combine a 16-bit checksum with a byte count indicating its coverage,
thus allowing either a header-only or a header-plus-data checksum
The latter has the advantage that the checksum can be computed without
determining the header length.
The error detection performance and computational cost of some common 16-bit
checksumming algorithms are summarized in Table 4. The implementations were
drawn from [16] and compiled on a SPARC IPX using the Sun ANSI C compiler
------------------------------
8. suggested by S. Casner
H. Schulzrinne Expires 03/01/94 [Page 38]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
with optimization. The checksum computation was repeated 100 times;
thus, due to data cache effects, the execution times shown are probably
better than would be measured in an actual application. The relative
performance, however, should be similar. Among the algorithms, the CRC has
the strongest error detection properties, particularly for burst errors,
while the remaining algorithms are roughly equivalent [16]. The Fletcher
algorithm with modulo 255 (shown here) has the peculiar property that a
transformation of a byte from 0 to 255 remains undetected. CRC, the IP
checksum and Fletcher's algorithm cannot detect spurious zeroes at the end
of a variable-length message [17]. The non-CRC checksums have the advantage
that they can be updated incrementally if only a few bytes have changed.
The latter property is important for translators that insert synchronization
source indicators.
algorithm ms
IP checksum 0.093
Fletcher's algorthm, optimized [17] 0.192
CRC CCITT 0.310
Fletcher's algorithm, non-optimized [18] 2.044
Table 4: Execution time of common 16-bit checksumming algorithms, for a
1024-byte packet, in milliseconds
3.11 Security and Privacy
3.11.1 Introduction
The discussions in this sections are based on the work of the privacy
enhanced mail (PEM) working group within the Internet Engineering Task
Force, as documented in [19,20] and related documents. The reader is
referred to RFC 1113 [19] or its successors for terminology. Also relevant
is the work on security for SNMP Version 2. We discuss here how the
following security-related services may be implemented for packet voice and
video:
Confidentiality: Measures that ensure that only the intended receiver(s)
can decode the received audio/video data; for others, the data contains
no useful information.
Authentication: Measures that allow the receiver(s) to ascertain the
identity of the sender of data or to verify that the claimed originator
is indeed the originator of the data.
Message integrity: Measures that allow the receiver(s) to detect whether
the received data has been altered.
H. Schulzrinne Expires 03/01/94 [Page 39]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
As for PEM [19], the following privacy-related concerns are not addressed at
this time:
o access control
o traffic flow confidentiality
o routing control
o assurance of data receipt and non-deniability of receipt
o duplicate detection, replay prevention, or other stream-oriented
services
These services either require connection-oriented services or support from
the lower layers that is currently unavailable. A reasonable goal is to
provide privacy at least equivalent to that provided by the public telephone
system (except for traffic flow confidentiality).
As for privacy-enhanced mail, the sender determines which privacy
enhancements are to be performed for a particular part of a data
transmission. Therefore, mechanisms should be provided that allow the
sender to determine whether the desired recipients are equipped to process
any privacy-enhancements. This is functionally similar to the negotiation
of, say, media encodings and should probably be handled by similar
mechanisms. It is anticipated that privacy-enhanced mail will be used
in the absence of or in addition to session establishment protocols and
agents to distributed keys or negotiate the enhancements to be used during a
conference.
3.11.2 Confidentiality
Only data encryption can provide confidentiality as long as intruders can
monitor the channel. It is desirable to specify an encryption algorithm and
provide implementations without export restrictions. Although DES is widely
available outside the United States, its use within software in both source
and binary form remains difficult.
We have the choice of either encrypting and/or authenticating the whole
packet or only the options and payload. Encrypting the fixed header denies
the intruder knowledge about some conference details (such as timing and
format) and protects against replay attacks. Encrypting the fixed header
also allows some heuristic detection of key mismatches, as the version
identifier, timestamp and other header information are somewhat predictable.
However, header encryption makes packet traces and debugging by external
programs difficult. Also, since translators may need to inspect and modify
the header, but do not have access to the sender's key, at least part of
H. Schulzrinne Expires 03/01/94 [Page 40]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
the header needs to remain unencrypted, with the ability for the receiver
to discern which part has been encrypted. Given these complications and
the uncertain benefits of header encryption, it appears appropriate to limit
encryption to the options and payload part only.
In public key cryptography, the sender uses the receiver's public key for
encryption. Public key cryptography does not work for true multicast
systems since the public encoding key for every recipient differs, but it
may be appropriate when used in two-party conversations or application-level
multicast. In that case, mechanisms similar to privacy enhanced mail will
probably be appropriate. Key distribution for symmetric-key encryption such
as DES is beyond the scope of this recommendation, but the services of
privacy enhanced mail [19,21] may be appropriate.
For one-way applications, it may desirable to prohibit listeners from
interrupting the broadcast. (After all, since live lectures on campus
get disrupted fairly often, there is reason to fear that a sufficiently
controversial lecture carried on the Internet could suffer a similar fate.)
Again, asymmetric encryption can be used. Here, the decryption key is
made available to all receivers, while the encryption key is known only
to the legitimate sender. Current public-key algorithms are probably too
computationally intensive for all but low-bit-rate voice. In most cases,
filtering based on sources will be sufficient.
3.11.3 Message Integrity and Authentication
The usual message digest methods are applicable if only the integrity of the
message is to be protected against tampering. Again, services similar to
that of privacy-enhanced mail [22] may be appropriate. The MD5 message
digest [23] appears suitable. It translates any size message into a 128-bit
(16-byte) signature. On a SPARCstation IPX (Sun 4/50), the computation
of a signature for a 180-byte audio packet takes approximately 0.378 ms(9)
Defining the signature to apply to all data beginning at the signature
option allows operation when translators change headers. The receiver has
to be able to locate the public key of the claimed sender. This poses two
problems: first, a way of identifying the sender unambiguously needs to be
found. The current methods of identification, such as the SMTP (e-mail)
address, are not unambiguous. Use of a distinguished name as described in
RFC 1255 [24] is suggested.
The authentication process is described in RFC 1422 [21]:
------------------------------
9. The processing rates for Sun 4/50 (40 MHz clock) and SPARCstation 10's
(36 MHz clock) are 0.95 and 2.2 MB/s, respectively, measured for a single
1000-byte block. Note that timing the repeated application of the algorithm
for the same block of data gives optimistic results since the data then
resides in the cache.
H. Schulzrinne Expires 03/01/94 [Page 41]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
In order to provide message integrity and data origin
authentication, the originator generates a message integrity code
(MIC), signs (encrypts) the MIC using the private component of his
public-key pair, and includes the resulting value in the message
header in the MIC-Info field. The certificate of the originator
is (optionally) included in the header in the Certificate field
as described in RFC 1421. This is done in order to facilitate
validation in the absence of ubiquitous directory services. Upon
receipt of a privacy enhanced message, a recipient validates the
originator's certificate (using the IPRA public component as the
root of a certification path), checks to ensure that it has not
been revoked, extracts the public component from the certificate,
and uses that value to recover (decrypt) the MIC. The recovered
MIC is compared against the locally calculated MIC to verify the
integrity and data origin authenticity of the message.
For audio/video applications with loose control, the certificate could be
carried periodically to allow new listeners to obtain it and to achieve a
measure of reliability.
Symmetric key methods such as DES can also be used. Here, the key is
simply prefixed to the message when computing the message digest (MIC), but
not transmitted. The receiver has to obtain the sender's key through a
secure channel, e.g., a PEM message. The method has the advantage that no
cryptography is involved, thus alleviating export-control concerns. It is
used for SNMP Version 2 authentication.
3.12 Security for RTP vs. PEM
It is the author's opinion that RTP should aim to reuse as much of the
PEM technology and syntax as possible, unless there are strong reasons in
the nature of real-time traffic to deviate. This has the advantage that
terminology, implementation experience, certificate mechanisms and possibly
code can be reused. Also, since it is hoped that RTP finds use in a range
of applications, a broad spectrum of security mechanisms should be provided,
not necessarily limited by what is appropriate for large-distribution audio
and video conferences.
It should be noted that connection-oriented security architectures are
probably unsuitable for RTP applications as they rely on reliable stream
transmission and an explicit setup phase with typically only a single sender
and receiver.
There are a number of differences between the security requirements of PEM
and RTP that should be kept in mind:
Transparency: Unlike electronic mail, it is safe to assume that the channel
H. Schulzrinne Expires 03/01/94 [Page 42]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
will carry 8 bit data unaltered. Thus, a conversion to a canonical
form or encoding binary data into a 64-element subset as done for PEM
is not required.
Time: As outlined at the beginning of this document, processing speed and
packet overhead have to be major considerations, much more so than with
store-and-forward electronic mail. Message digest algorithms and DES
can be implemented sufficiently fast even in software to be used for
voice and possibly for low-bit rate video. Even for short signatures,
RSA encryption is fairly slow.
Note that the ASN.1/BER encoding of asymmetrically-encrypted MICs and
certificates adds no significant processing load. For the MICs, the
ASN.1 algorithm yields only additional constant bytes which a paranoid
program can check, but does not need to decode. Certificates are
carried much more infrequently and are relatively simple structures.
It would seem unnecessary to supply a complete ASN.1/BER parser for any
of the datastructures.
Space: Encryption algorithm require a minimum data input equal to their
keylength. Thus, for the suggested key length for RSA encryption
of 508 to 1024 bits, the 16-byte message digest expands to a 53
to 128 byte MIC. This is clearly rather burdensome for short audio
packets. Applying a single message digest to several packets seems
possible if the packet loss rates are sufficiently low, even though it
does introduce minor security risks in the case where the receiver is
forced to decide between accepting as authentic an incomplete sequence
of packets or rejecting the whole sequence. Note that it would
not be necessary to wait with playback until a complete authenticated
block has been received; in general, a warning that authentication has
failed would be sufficient for human users. The application should
also issue a warning if no complete block could be authenticated for
several blocks, as that might indicate that an impostor was feigning
the presence of MIC-protected data by strategically dropping packets.
The initialization vector for DES in cipher block mode adds another
eight bytes.
Scale: The symmetric key authentication algorithm used by PEM does not
scale well for a large number of receivers as the message has to
contain a separate MIC for each receiver, encrypted with the key for
that particular sender-receiver pair. If we forgo the ability to
authenticate an individual user, a single session key shared by all
participants can thwart impostors from outside the group holding the
shared secret.
H. Schulzrinne Expires 03/01/94 [Page 43]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
3.13 Quality of Service Control
Because real-time services cannot afford retransmissions, they are directly
affected by packet loss and delays. Delay jitter and packet loss, for
example, provide a good indication of network congestion and may suggest
switching to a lower-bandwidth coding. To aid in fault isolation and
performance monitoring, quality-of-service (QOS) measurement support is
useful. QOS of service monitoring is useful for the receiver of real-time
data, the sender of that data and possibly a third-party monitor, e.g.,
the network provider, that is itself not part of the real-time data
distribution.
3.13.1 QOS Measures
For real-time services, a number of QOS measures are of interest, roughly in
order of importance:
o packet loss
o packet delay variation (variance, minimum/maximum)
o relative clock drift (delay between sender and receiver timestamp)
In the following, the terms receiver and sender pertain to the real-time
data, not any returned QOS data. If the receiver is to measure packet loss,
an indication of the number of packets actually transmitted is required.
If the receiver itself does not need to compute packet loss percentages,
it is sufficient for the receiver to indicate to the sender the number of
packets received and the range timestamps covered, thus avoiding the need
for sequence numbers. Translation into loss at the sender is somewhat
complicated, however, unless restrictions on permissible timestamps (e.g.,
those starting a synchronization unit) are enforced. If sequence numbers
are available, the receiver has to track the number of times that the
sequence number has wrapped around, even in the face of packet reordering.
If c denotes the cycle count, M the sequence number modulus and s the
n
sequence number of the n received packet, where s is not necessarily
n
larger than s , we can write:
n-1
c =c +1 for -M<s -s <-M=2
n n
n-1 n-1
c =c -1 for M=2<s -s <M
n n
n-1 n-1
c =c otherwise
n
n-1
H. Schulzrinne Expires 03/01/94 [Page 44]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
For example, the sequence number sequence 65534;2;65535;1;3;5;4 would
yield the cycle number sequence 0;1;0;1;1;1;1 for M=65536, i.e., 16-bit
sequence numbers. The total number of expected packets is then computed
simply as s +M*c -s +1, where the first received packet has index 0.
n n
0
The user of the measurements should also have some indication as to the time
period they cover so that the degree of confidence in these statistical
meassurements can be established.
3.13.2 Remote measurements
It may be desirable for the sender, interested multicast group members
or a non-group member (third party) to have automatic access to
quality-of-service measurements. In particular, it is necessary for the
sender to gather a number of reception reports from different parts of the
Internet to ``triangulate'' where packets get lost or delayed.
Two modes of operation can be distinguished: monitor-driven or
receiver-driven. In the monitor-driven case, a site interested in QOS data
for a particular sender contacts the receiver through a back channel and
requests a reception report. Alternatively, each site can send reception
reports to a monitoring multicast group or as session data, along with
the ``regular station identification'' to the same multicast group used
for data. The first approach requires the most implementation effort,
but produces the least amount of data. The other two approaches have
complementary properties.
In most cases, sender-specific quality of service information is more useful
for tracking network problems than aggregrate data for all senders. Since
a site cannot transmit reception reports for all senders it has ever heard
from, some selection mechanism is needed, such as most-recently-heard or
cycling through sites.
Source identification poses some difficulties since the network address seen
by the receiver may not be meaningful to other members of the multicast
group, e.g., after IP-SIP address translation. On the other hand, network
addresses are easier to correlate with other network-level tools such as
those used for Mbone mapping.
minimum and maximum difference between departure and arrival timestamp.
This has the advantage that the fixed delay can also be estimated if
sender and receiver clocks are known to be synchronized. Unfortunately,
delay extrema are noisy measurement that give only limited indication of
the delay variability. The receiver could also return the playout delay
value it uses, although for absolute timing, that again depends on the
clock differential, as well as on the particular delay estimation algorithm
employed by the receiver. In summary, a minimal set of useful measurements
appears to be the expected and received packet count, combined with the
minimum and maximum timestamp difference.
H. Schulzrinne Expires 03/01/94 [Page 45]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
3.13.3 Monitoring by Third Party
Except for delay estimates based on sequence number ranges, the above
section applies for this case as well.
4 Conference Control Protocol
Currently, only conference control functions used for loosely controlled
conferences (open admission, no explicit conference set-up) have been
considered in depth. Support for the following functionality needs to be
specified:
o authentication
o floor control, token passing
o invitations, calls
o call forwarding, call transfer
o discovery of conferences and resources (directory service)
o media, encoding and quality-of-service negotiation
o voting
o conference scheduling
o user locator
The functional specification of a conference control protocol is beyond the
scope of this memorandum.
5 The Use of Profiles
RTP is intended to be a rather 'thin' protocol, partially because it aims
to serve a wide variety of real-time services. The RTP specification
intentionally leaves a number of issues open for other documents (profiles),
which in turn have the goal of making it easy to build interoperable
applications for a particular application domain, for example, audio and
video conferences.
Some of the issues that a profile should address include:
H. Schulzrinne Expires 03/01/94 [Page 46]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
o the interpretation of the 'content' field with the CDESC option
o the structure of the content-specific part at the end of the CDESC
option
o the mechanism by which applications learn about and define the mapping
between the 'content' field in the RTP fixed header and its meaning
o the use of the optional framing field prefixed to RTP packets (not
used, used only if underlying transport protocol does not provide
framing, used by some negotiation mechanism, always used)
o any RTP-over-x issues, that is, definitions needed to allow RTP to use
a particular underlying protocol
o content-specific RTP, RTCP or reverse control options
o port assignments for data and reverse control
6 Port Assignment
Since it is anticipated that UDP and similar port-oriented protocols will
play a major role in carrying RTP traffic, the issue of port assignment
needs to be addressed. The way ports are assigned mainly affects how
applications can extract the packets destined for them. For each medium,
there also needs to be a mechanism for distinguishing data from control
packets.
For unicast UDP, only the port number is available for demultiplexing.
Thus, each media will need a separate port number pair unless a separate
demultiplexing agent is used. However, for one-to-one connections,
dynamically negotiating a port number is easy. If several UDP streams are
used to provide multicast by transport-level replication, the port number
issue becomes somewhat more difficult. For ST-II, a common port number has
to be agreed upon by all participants, which may be difficult particularly
if a new site wants to join an on-going connection, but is already using the
port number in a different connection.
For UDP multicast, an application can select to receive only packets with a
particular port number and multicast address by binding to the appropriate
multicast address(10) . Thus, for UDP multicast, there is no need to
distinguish media by port numbers, as each medium could have its designated
and unique multicast group. Any dynamic port allocation mechanism would
fail for large, dynamic multicast groups, but might be appropriate for small
------------------------------
10. This extension to the original multicast socket semantics is currently
in the process of being deployed.
H. Schulzrinne Expires 03/01/94 [Page 47]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
conferences and two-party conversations.
Data and control packets for a single medium can either share a single
port or use two different port numbers. (Currently, two adjacent port
numbers, 3456 and 3457, are used.) A single port for data and control
simplifies the receiver code and translators and, less important, conserves
port numbers. With the proliferation of firewalls, limiting the number of
ports has assumed additional importance. Sharing a single port requires
some other means of identifying control packets, for example as a special
encoding code. Alternatively, all control data could be carried as options
within data packets, akin to the NVP protocol options. Since control
messages are also transmitted if no actual medium data is available, header
content of packets without media data needs to be determined. With the use
of a synchronization bit, the issue of how sequence numbers and timestamps
are to be treated for these packets is less critical. It is suggested to
use a zero timestamp and to increment the sequence number normally. Due to
the low bandwidth requirements of typical control information, the issue of
accomodating control information in any bandwidth reservation scheme should
be manageable. The penalty paid is the eight-byte overhead of the RTP
header for control packets that do not require time stamps, encoding and
sequence number information.
Using a single RTCP stream for several media may be advantageous to
avoid duplicating, for example, the same identification information for
voice, video and whiteboard streams. This works only if there is one
multicast group that all members of a conference subscribe to. Given
the relatively low frequency of control messages, the coordination effort
between applications and the necessity to designate control messages for a
particular medium are probably reasons enough to have each application send
control messages to the same multicast group as the data.
In conclusion, for multicast UDP, one assigned port number, for both data
and control, seems to offer the most advantages, although the data/control
split may offer some bandwidth savings.
7 Multicast Address Allocation
A fixed, permanent allocation of network multicast addresses to invidual
conferences by some naming authority such as the Internet Assigned Numbers
Authority is clearly not feasible, since the lifetime of conferences is
unknown, the potential number of conferences is rather large and the
28 16
available number space limited to about 2 , of which 2 have been set
aside for dynamic allocation by conferences.
The alternative to permanent allocation is a dynamic allocation, where an
initiator of a multicast application obtains an unused multicast address in
some manner (discussed below). The address is then made available again,
either implicitly or explicitly, as the application terminates.
H. Schulzrinne Expires 03/01/94 [Page 48]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
The address allocation may or may not be handled by the same mechanism that
provides conference naming and discovery services. Separating the two has
the advantage that dynamic (multicast) address allocation may be useful
to applications other than conferencing. Also, different mechanisms (for
example, periodic announcements vs. servers) may be appropriate for each.
We can distinguish two methods of multicast address assignment:
function-based: all applications of a certain type share a common, global
address space. Currently, a reservation of a 16-bit address space for
conferences is one example. The advantage of this scheme is that
directory functions and allocation can be readily combined, as is done
in the sd tool by Van Jacobson. A single namespace spanning the
globe makes it necessary to restrict the scope of addresses so that
allocation does not require knowing about and distributing information
about the existence of all global conferences.
hierarchical: Based on the location of the initiator, only a subset of
addresses are available. This limits the number of hosts that
could be involved in resolving collisions, but, like most hierarchical
assignment, leads to sparse allocation. Allocation is independent of
the function the address is used for.
Clearly, combinations are possible, for example, each local namespace could
be functionally divided if sufficiently large. With the current allocation
16
of 2 addresses to conferences, hierarchical division except on a very
coarse scale is not feasible.
To a limited extent, multicast address allocation can be compared to the
well-known channel multiple access problem. The multicast address space
plays the role of the common channel, with each address representing a time
slot.
All the following schemes require cooperation from all potential users of
the address space. There is no protection against an ignorant or malicious
user joining a multicast group.
7.1 Channel Sensing
In this approach, the initiator randomly selects a multicast address from a
given range, joins the multicast group with that address and listens whether
some other host is already transmitting on that address. This approach does
not require a separate address allocation protocol or an address server,
but it is probably infeasible for a number of reasons. First, a user
process can only bind to a single port at one time, making 'channel sensing'
difficult. Secondly, unlike listening to a typical broadcast channel, the
H. Schulzrinne Expires 03/01/94 [Page 49]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
act of joining the multicast group can be quite expensive both for the
listening host and the network. Consider what would happen if a host
attached through a low-bandwidth connection joins a multicast group carrying
video traffic, say.
Channel sensing may also fail if two sections of the network that were
separated at the time of address allocation rejoin later. Changes in
time-to-live values can make multicast groups 'visible' to hosts that
previously were outside their scope.
7.2 Global Reservation Channel with Scoping
Each range of multicast addresses has an associated well-known multicast
address and port where all initiators (and possibly users) advertise the use
of multicast addresses. An initiator first picks a multicast address at
random, avoiding those already known to be in use. Some mechanism for
collision resolution has to be provided in the unlikely event that two
initiators simultaneously choose the same address. Also, since address
advertisement will have to be sent at fairly long intervals to keep traffic
down, an application wanting to start a conference, for example, has to
wait for an extended period of time unless it continuously monitors the
allocation multicast group.
To limit traffic, it may seem advisable to only have the initiator multicast
the address usage advertisement. This, however, means that there needs to
be a mechanism for another site to take over advertising the group if the
initiator leaves, but the multicast group continues to exist. Time-to-live
restrictions pose another problem. If only a single source advertises the
group, the advertisement may not reach all those sites that could be reached
by the multicast transmissions themselves.
The possibility of collisions can be reduced by address reuse with scoping,
discussed further below, and by adding port numbers and other identifiers
as further discriminators. The latter approach appears to defeat the
purpose of using multicast to avoid transmitting information to hosts that
have no interest in receiving it. Routers can only filter based on group
membership, not ports or other higher-layer demultiplexing identifiers.
Thus, even though two conferences with the same multicast address and
different ports, say, could coexist at the application layer, this would
force hosts and networks that are interested in only one of the conferences
to deal with the combined traffic of the two conferences.
7.3 Local Reservation Channel
Instead of sharing a global namespace for each application, this scheme
divides the multicast address space hierarchically, allowing an initiator
within a given network to choose from a smaller set of multicast addresses,
H. Schulzrinne Expires 03/01/94 [Page 50]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
but independent of the application. As with many allocation problems, we
can devise both server-based and fully distributed versions.
7.3.1 Hierarchical Allocation with Servers
By some external means, address servers, distributed throughout the network,
are provided with non-overlapping regions of the multicast address space.
An initiator asks its favorite address server for an address when needed.
When it no longer needs the address, it returns it to the server. To
prevent addresses from disappearing when the requestor crashes and looses
its memory about allocated addresses, requests should have an associated
time-out period. This would also (to some extent) cover the case that the
initiator leaves the conference, without the conference itself disbanding.
To decrease the chances that an initiator cannot be provided with an
address, either the local server could 'borrow' an address from another
server or could point the initiator to another server, somewhat akin to the
methods used by the Domain Name Service (DNS). Provisions have to be made
for servers that crash and may loose knowledge about the status of its block
of addresses, in particular their expiration times. The impact of such
failures could be mitigated by limiting the maximum expiration time to a few
hours. Also, the server could try to request status by multicast from its
clients.
7.3.2 Distributed Hierarchical Allocation
Instead of a server, each network is allocated a set of multicast
addresses. Within the current IP address space, both class A, B and C
networks would get roughly 120 addresses, taking into account those that
have been permanently assigned. Contention for addresses works like the
global reservation channel discussed earlier, but the reservation group is
strictly limited to the local network. (Since the address ranges are
disjoint, address information that inadvertently leaks outside the network,
is harmless.)
This method avoids the use of servers and the attendant failure modes, but
introduces other problems. The division of the address space leads to a
barely adequate supply of addresses (although larger address formats will
probably make that less of an issue in the future). As for any distributed
algorithm, splitting of networks into temporarily unconnected parts can
easily destroy the uniqueness of addresses. Handling initiators that leave
on-going conferences is probably the most difficult issue.
H. Schulzrinne Expires 03/01/94 [Page 51]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
7.4 Restricting Scope by Limiting Time-to-Live
Regardless of the address allocation method, it may be desirable to
distinguish multicast addresses with different reach. A local address would
be given out with the restriction of a maximum time-to-live value and could
thus be reused at a network sufficiently removed, akin to the combination
of cell reuse and power limitation in cellular telephony. Given that many
conferences will be local or regional (e.g., broadcasting classes to nearby
campuses of the same university or a regional group of universities, or an
electronic town meeting), this should allow significant reuse of addresses.
Reuse of addresses requires careful engineering of thresholds and would
probably only be useful for very small time-to-live values that restrict
reach to a single local area network. Using time-to-live fields to restrict
scope rather than just prevent looping introduces difficult-to-diagnose
failure modes into multicast sessions. In particular, reachability is no
longer transitive, as B may have A and C in its scope, but A and B may be
outside each other's scope (or A may be in the scope of B, but not vice
versa, due to asymmetric routes, etc.). This problem is aggravated by the
fact that routers (for obvious reasons) are not supposed to return ICMP time
exceeded messages, so that the sender can only guess why multicast packets
do not reach certain receivers.
8 Security Considerations
Security issues are discussed in Section 3.11.
Acknowledgments
This draft is based on discussion within the AVT working group chaired by
Stephen Casner. Eve Schooler and Stephen Casner provided valuable comments.
This work was supported in part by the Office of Naval Research under
contract N00014-90-J-1293, the Defense Advanced Research Projects Agency
under contract NAG2-578 and a National Science Foundation equipment grant,
CERDCR 8500332.
A Glossary
The glossary below briefly defines the acronyms used within the text.
Further definitions can be found in RFC 1392, ``Internet User's Glossary''.
Some of the general Internet definitions below are copied from that
glossary. The quoted passages followed by a reference of the form
H. Schulzrinne Expires 03/01/94 [Page 52]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
``(G.701)'' are drawn from the CCITT Blue Book, Fascicle I.3, Definitions.
The glossary of the document ``Recommended Practices for Enhancing Digital
Audio Compatibility in Multimedia Systems'', published by the Interactive
Multimedia Association was used for some terms marked with [IMA]. The
section on MPEG is based on text written by Mark Adler (Caltech).
4:1:1 Refers to degree of subsampling of the two chrominance signals with
respect to the luminance signal. Here, each color difference component
has one quarter the resolution of the luminance component.
4:2:2 Refers to degree of subsampling of the two chrominance signals with
respect to the luminance signal. Here, each color difference component
has half the resolution of the luminance component.
16/16 timestamp: a 32-bit integer timestamp consisting of a 16-bit field
containing the number of seconds followed by a 16-bit field containing
the binary fraction of a second. This timestamp can measure about 18.2
hours with a resolution of approximately 15 microseconds.
n=m timestamp: a n+m bit timestamp consisting of an n-bit second count and
an m-bit fraction.
ADPCM: Adaptive differential pulse code modulation. Rather than
transmitting ! PCM samples directly, the difference between the
estimate of the next sample and the actual sample is transmitted. This
difference is usually small and can thus be encoded in fewer bits than
the sample itself. The ! CCITT recommendations G.721, G.723, G.726
and G.727 describe ADPCM encodings. ``A form of differential pulse
code modulation that uses adaptive quantizing. The predictor may be
either fixed (time invariant) or variable. When the predictor is
adaptive, the adaptation of its coefficients is made from the quantized
difference signal.'' (G.701)
adaptive quantizing: ``Quantizing in which some parameters are made
variable according to the short term statistical characteristics of the
quantized signal.'' (G.701)
A-law: a type of audio !companding popular in Europe.
CCIR: Comite Consultativ International de Radio. This organization is
part of the United Nations International Telecommunications Union (ITU)
and is responsible for making technical recommendations about radio,
television and frequency assignments. The CCIR has recently changed
its name to ITU-TR; we maintain the more familiar name. !CCITT
CCIR-601: The CCIR-601 digital television standard is the base for all the
subsampled interchange formats such as SIF, CIF, QCIF, etc. For NTSC
(PAL/SECAM), it is 720 (720) pixels by 243 (288) lines by 60 (50)
fields per second, where the fields are interlaced when displayed.
The chrominance channels horizontally subsampled by a factor of two,
H. Schulzrinne Expires 03/01/94 [Page 53]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
yielding 360 (360) pixels by 243 (288) lines by 60 (50) fields a
second.
CCITT: Comite Consultatif International de Telegraphique et Telephonique
(CCITT). This organization is part of the United Nations International
Telecommunications Union (ITU) and is responsible for making technical
recommendations about telephone and data communications systems. X.25
is an example of a CCITT recommendation. Every four years CCITT holds
plenary sessions where they adopt new recommendations. Recommendations
are known by the color of the cover of the book they are contained in.
(The 1988 edition is known as the Blue Book.) The CCITT has recently
changed its name to ITU-TS; we maintain the familiar name. !CCIR
CELP: code-excited linear prediction; audio encoding method for low-bit
rate codecs; !LPC.
CD: compact disc.
chrominance: color information in a video image. For !H.261, color is
encoded as two color differences: CR (``red'') and CB (``blue'').
!luminance
CIF: common interchange format; interchange format for video images with
288 lines with 352 pixels per line of luminance and 144 lines with 176
pixel per line of chrominance information. !QCIF, SCIF
CLNP: ISO connectionless network-layer protocol (ISO 8473), similar in
functionality to !IP.
codec: short for coder/decoder; device or software that ! encodes and
decodes audio or video information.
companding: contraction of compressing and expanding; reducing the dynamic
range of audio or video by a non-linear transformation of the sample
values. The best known methods for audio are mu-law, used in North
America, and A-law, used in Europe and Asia. !G.711 For a given
number of bits, companded data uses a greater number of binary codes to
represent small signal levels than linear data, resulting in a greater
dynamic range at the expense of a poorer signal-to-nose ratio. [25]
DAT: digital audio tape.
decimation: reduction of sample rate by removal of samples [IMA].
delay jitter: Delay jitter is the variation in end-to-end network delay,
caused principally by varying media access delays, e.g., in an
Ethernet, and queueing delays. Delay jitter needs to be compensated
by adding a variable delay (refered to as ! playout delay) at the
receiver.
H. Schulzrinne Expires 03/01/94 [Page 54]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
DVI: (trademark) digital video interactive. Audio/video compression
technology developed by Intel's DVI group. [IMA]
dynamic range: a ratio of the largest encodable audio signal to the
smallest encodable signal, expressed in decibels. For linear audio
data types, the dynamic range is approximately six times the number of
bits, measured in dB.
encoding: transformation of the media content for transmission, usually to
save bandwidth, but also to decrease the effect of transmission errors.
Well-known encodings are G.711 (mu-law PCM), and ADPCM for audio, JPEG
and MPEG for video. ! encryption
encryption: transformation of the media content to ensure that only the
intended recipients can make use of the information. ! encoding
end system: host where conference participants are located. RTP packets
received by an end system are played out, but not forwarded to other
hosts (in a manner visible to RTP).
FIR: finite (duration) impulse response. A signal processing filter that
does not use any feedback components [IMA].
frame: unit of information. Commonly used for video to refer to a single
picture. For audio, it refers to a data that forms a encoding unit.
For example, an LPC frame consists of the coefficients necessary to
generate a specific number of audio samples.
frequency response: a system's ability to encode the spectral content of
audio data. The sample rate has to be at least twice as large as the
maximum possible signal frequency.
G.711: ! CCITT recommendation for ! PCM audio encoding at 64 kb/s using
mu-law or A-law companding.
G.721: ! CCITT recommendation for 32 kbit/s adaptive differential pulse
code modulation (! ADPCM, PCM).
G.722: ! CCITT recommendation for audio coding at 64 kbit/s; the audio
bandwidth is 7 kHz instead of 3.5 kHz for G.711, G.721, G.723 and
G.728.
G.723: ! CCITT recommendation for extensions of Recommendation G.721
adapted to 24 and 40 kbit/s for digital circuit multiplication
equipment.
G.728: ! CCITT recommendation for voice coding using code-excited linear
prediction (CELP) at 16 kbit/s.
G.764: ! CCITT recommendation for packet voice; specifies both ! HDLC-like
data link and network layer. In the draft stage, this standard was
H. Schulzrinne Expires 03/01/94 [Page 55]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
referred to as G.PVNP. The standard is primarily geared towards digital
circuit multiplication equipment used by telephone companies to carry
more voice calls on transoceanic links.
G.821: ! CCITT recommendation for the error performance of an
international digital connection forming part of an integrated services
digital network.
G.822: ! CCITT recommendation for the controlled !slip rate objective on
an international digital connection.
G.PVNP: designation of CCITT recommendation ! G.764 while in draft status.
GOB: (H.261) groups of blocks; a !CIF picture is divided into 12 GOBs, a
QCIF into 3 GOBs. A GOB is composed of 3 macro blocks (!MB) and
contains luminance and chrominance information for 8448 pixels.
GSM: Group Speciale Mobile. In general, designation for European mobile
telephony standard. In particular, often used to denote the audio
coding used. Formally known as the European GSM 06.10 provisional
standard for full-rate speech transcoding, prI-ETS 300 036. It uses
RPE/LTP (residual pulse excitation/long term prediction) at 13 kb/s
using frames of 160 samples covering 20 ms.
H.261: ! CCITT recommendation for the compression of motion video at rates
of Px64 kb/s (where p=1:::30. Originally intended for narrowband
!ISDN.
hangover: [26] Audio data transmitted after the silence detector indicates
that no audio data is present. Hangover ensures that the ends of
words, important for comprehension, are transmitted even though they
are often of low energy.
HDLC: high-level data link control; standard data link layer protocol
(closely related to LAPD and SDLC).
IMA: Interactive Multimedia Assocation; trade association located in
Annapolis, MD.
ICMP: Internet Control Message Protocol; ICMP is an extension to the
Internet Protocol. It allows for the generation of error messages,
test packets and informational messages related to ! IP.
in-band: signaling information is carried together (in the same channel or
packet) with the actual data. ! out-of-band.
interpolation: increase in sample rate by introduction of processed
samples.
IP: internet protocol; the Internet Protocol, defined in RFC 791, is the
network layer for the TCP/IP Protocol Suite. It is a connectionless,
H. Schulzrinne Expires 03/01/94 [Page 56]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
best-effort packet switching protocol [27].
IP address: four-byte binary host interface identifier used by !IP for
addressing. An IP address consists of a network portion and a
host portion. RTP treats IP addresses as globally unique, opaque
identifiers.
IPv4: current version (4) of ! IP.
ISDN: integrated services digital network; refers to an end-to-end circuit
switched digital network intended to replace the current telephone
network. ISDN offers circuit-switched bandwidth in multiples of 64
kb/s (B or bearer channel), plus a 16 kb/s packet-switched data (D)
channel.
ISO: International Standards Organization. A voluntary, nontreaty
organization founded in 1946. Its members are the national
standardards organizations of the 89 member countries, including ANSI
for the U.S. (Tanenbaum)
ISO 10646: !ISO standard for the encoding of characters from all languages
into a single 32-bit code space (Universal Character Set). For
transmission and storage, a one-to-five octet code (UTF) has been
defined which is upwardly compatible with US-ASCII.
JPEG: ISO/CCITT joint photographic experts group. Designation of a
variable-rate compression algorithm using discrete cosine transforms
for still-frame color images.
jitter: ! delay jitter.
linear encoding: a mapping from signal values to binary codes where each
binary level represents the same signal increment !companding.
loosely controlled conference: Participants can join and leave the
conference without connection establishment or notifying a conference
moderator. The identity of conference participants may or may not be
known to other participants. See also: tightly controlled conference.
low-pass filter: a signal processing function that removes spectral content
above a cutoff frequency. [IMA]
LPC: linear predictive coder. Audio encoding method that models speech as
a parameters of a linear filter; used for very low bit rate codecs.
luminance: brightness information in a video image. For black-and-
white (grayscale) images, only luminance information is required.
!chrominance
MB: (H.261) macroblock, consisting of six blocks, four eight-by-eight
luminance blocks and two chrominance blocks.
H. Schulzrinne Expires 03/01/94 [Page 57]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
MPEG: ISO/CCITT motion picture experts group JTC1/SC29/WG11. Designates a
variable-rate compression algorithm for full motion video at low bit
rates; uses both intraframe and interframe coding. It defines a bit
stream for compressed video and audio optimized to fit into a bandwidth
(data rate) of 1.5 Mbits/s. This rate is special because it is the
data rate of (uncompressed) audio CD's and DAT's. The draft is in
three parts, video, audio, and systems, where the last part gives the
integration of the audio and video streams with the proper timestamping
to allow synchronization of the two. MPEG phase II is to define a
bitstream for video and audio coded at around 3 to 10 Mbits/s.
MPEG compresses YUV SIF images. Motion is predicted from frame to
frame, while DCTs of the difference signal with quantization make use
of spatial redundancy. DCTs are performed on 8 by 8 blocks, the motion
prediction on 16 by 16 blocks of the luminance signal. Quantization
changes for every 16 by 16 macroblock.
There are three types of coded frames. Intra (``I'') frames are coded
without motion prediction, Predicted (``P'') frames are difference
frames to the last P or I frame. Each macroblock in a P frame can
either come with a vector and difference DCT coefficients for a close
match in the last I or P frame, or it can just be intra coded (like
in the I frames) if there was no good match. Lastly, there are "B"
or bidirectional frames. They are predicted from the closest two I or
P frames, one in the past and one in the future. These are searched
for matching blocks in those frames, and three different things tried
to see which works best: the forward vector, the backward vector, and
the average of the two blocks from the future and past frames, and
subtracting that from the block being coded. If none of those work
well, the block is intra-coded.
There are 12 frames from I to I, based on random access requirements.
MPEG-1: Informal name of proposed !MPEG (ISO standard DIS 1172).
media source: entity (user and host) that produced the media content.
It is the entity that is shown as the active participant by the
application.
MTU: maximum transmission unit; the largest frame length which may be sent
on a physical medium.
Nevot: network voice terminal; application written by the author.
network source: entity denoted by address and port number from which the !
end system receives the RTP packet and to which the end system send any
RTP packets for that conference in return.
NTP timestamp: ``NTP timestamps are represented as a 64-bit unsigned
fixed-point number, in seconds relative to 0 hours on 1 January 1900.
The integer part is in the first 32 bits and the fraction part in the
H. Schulzrinne Expires 03/01/94 [Page 58]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
last 32 bits.'' [13] NTP timestamps do not include leap seconds, i.e.,
each and every day contains exactly 86,400 NTP seconds.
NVP: network voice protocol; original packet format used in early packet
voice experiments; defined in [1].
octet: An octet is an 8-bit datum, which may contain values 0 through 255
decimal. Commonly used in ISO and CCITT documents, also known as a
byte.
OSI: Open System Interconnection; a suite of protocols, designed by
ISO committees, to be the international standard computer network
architecture.
out of band: signaling and control information is carried in a separate
channel or separate packets from the actual data. For example, ICMP
carries control information out-of-band, that is, as separate packets,
for IP, but both ICMP and IP usually use the same communication channel
(in band).
parametric coder: coder that encodes parameters of a model representing the
input signal. For example, LPC models a voice source as segments of
voice and unvoiced speech, represented by a set of
parametric coder: coder that encodes parameters of a model representing the
input signal. For example, LPC models a voice source as segments of
voice and unvoiced speech, represented by filter parameters. Examples
include LPC, CELP and GSM. !waveform coder.
PCM: pulse-code modulation; speech coding where speech is represented by a
given number of fixed-width samples per second. Often used for the
coding employed in the telephone network: 64,000 eight-bit samples per
second.
pel, pixel: picture element. ``Smallest graphic element that can be
independently addressed within a picture; (an alternative term for
raster graphics element).'' (T.411)
playout: Delivery of the medium content to the final consumer within the
receiving host. For audio, this implies digital-to-analog conversion,
for video display on a screen.
playout unit: A playout unit is a group of packets sharing a common
timestamp. (Naturally, packets whose timestamps are identical due
to timestamp wrap-around are not considered part of the same playout
unit.) For voice, the playout unit would typically be a single voice
segment, while for video a video frame could be broken down into
subframes, each consisting of packets sharing the same timestamp and
ordered by some form of sequence number. !synchronization unit
H. Schulzrinne Expires 03/01/94 [Page 59]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
plesiochronous: ``The essential characteristic of time-scales or signals
such that their corresponding significant instants occur at nominally
the same rate, any variation in rate being constrained within specified
limits. Two signals having the same nominal digit rate, but not
stemming from the same clock or homochronous clocks, are usually
plesiochronous. There is no limit to the time relationship between
corresponding significant instants.'' (G.701, Q.9) In other words,
plesiochronous clocks have (almost) the same rate, but possibly
different phase.
pulse code modulation (PCM): ``A process in which a signal is sampled, and
each sample is quantized independently of other samples and converted
by encoding to a digital signal.'' (G.701)
PVP: packet video protocol; extension of ! NVP to video data [28]
QCIF: quarter common interchange format; format for exchanging video images
with half as many lines and half as many pixels per line as CIF, i.e.,
luminance information is coded at 144 lines and 176 pixels per line.
!CIF, SIF
RTCP: real-time control protocol; adjunct to ! RTP.
RTP: real-time transport protocol; discussed in this memorandum.
sampling rate: ``The number of samples taken of a signal per unit time.''
(G.701)
SB: subband; as in subband codec. Audio or video encoding that splits the
frequency content of a signal into several bands and encodes each band
separately, with the encoding fidelity matched to human perception for
that particular frequency band.
SCIF: standard video interchange format; consists of four !CIF images
arranged in a square. !CIF, QCIF
SIF: standard interchange format; format for exchanging video images of 240
lines with 352 pixels each for NTSC, and 288 lines by 352 pixels for
PAL and SECAM. At the nominal field rates of 60 and 50 fields/s, the
two formats have the same data rate. !CIF, QCIF
slip: In digital communications, slip refers to bit errors caused by the
different clock rates of nominally synchronous sender and receiver. If
the sender clock is faster than the receiver clock, occasionally a bit
will have to be dropped. Conversely, a faster receiver will need to
insert extra bits. The problem also occurs if the clock rates of
encoder and decoder are not matched precisely. Information loss can be
avoided if the duration of pauses (silence periods between talkspurts
or the inter-frame duration) can be adjusted by the receiver. ``The
repetition or deletion of a block of bits in a synchronous or
plesiochronous bit stream due to a discrepancy in the read and write
H. Schulzrinne Expires 03/01/94 [Page 60]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
rates at a buffer.'' (G.810) !G.821, G.822
ST-II: stream protocol; connection-oriented unreliable, non-sequenced
packet-oriented network and transport protocol with process demulti-
plexing and provisions for establishing flow parameters for resource
control; defined in RFC 1190 [29,30].
Super CIF: video format defined in Annex IV of !H.261 (1992), comprising
704 by 576 pixels.
synchronization unit: A synchronization unit consists of one or more
!playout units that, as a group, share a common fixed delay between
generation and playout of each part of the group. The delay may change
at the beginning of such a synchronization unit. The most common
synchronization units are talkspurts for voice and frames for video
transmission.
TCP: transmission control protocol; an Internet Standard transport layer
protocol defined in RFC 793. It is connection-oriented and
stream-oriented, as opposed to UDP [31].
TPDU: transport protocol data unit.
tightly controlled conference: Participants can join the conference only
after an invitation from a conference moderator. The identify of all
conference participants is known to the moderator. !loosely controlled
conference.
transcoder: device or application that translates between several
encodings, for example between ! LPC and ! PCM.
UDP: user datagram protocol; unreliable, non-sequenced connectionless
transport protocol defined in RFC 768 [32].
vat: visual audio tool written by Steve McCanne and Van Jacobson, Lawrence
Berkeley Laboratory.
vt: voice terminal software written at the Information Sciences Institute.
VMTP: Versatile message transaction protocol; defined in RFC 1045 [33].
waveform coder: a coder that tries to reproduce the waveform after
decompression; examples include PCM and ADPCM for audio and video and
discrete-cosine-transform based coders for video; !parametric coder.
Y: Common abbreviation for the luminance or luma signal.
YCbCr: YCbCr coding is employed by D-1 component video equipment.
H. Schulzrinne Expires 03/01/94 [Page 61]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
B Address of Author
Henning Schulzrinne
AT&T Bell Laboratories
MH 2A244
600 Mountain Avenue
Murray Hill, NJ 07974-0636
telephone: +1 908 582 2262
facsimile: +1 908 582 5809
electronic mail: hgs@research.att.com
References
[1] D. Cohen, ``A network voice protocol: NVP-II,'' technical report,
University of Southern California/ISI, Marina del Ray, California,
Apr. 1981.
[2] N. Borenstein and N. Freed, ``MIME (multipurpose internet mail
extensions) mechanisms for specifying and describing the format of
internet message bodies,'' Network Working Group Request for Comments
RFC 1341, Bellcore, June 1992.
[3] R. Want, A. Hopper, V. Falcao, and J. Gibbons, ``The active badge
location system,'' ACM Transactions on Information Systems, vol. 10,
pp. 91--102, Jan. 1992.
[4] R. Want and A. Hopper, ``Active badges and personal interactive
computing objects,'' Technical Report ORL 92-2, Olivetti Research,
Cambridge, England, Feb. 1992. also in IEEE Transactions on Consumer
Electronics, Feb. 1992.
[5] J. G. Gruber and L. Strawczynski, ``Subjective effects of variable
delay and speech clipping in dynamically managed voice systems,'' IEEE
Transactions on Communications, vol. COM-33, pp. 801--808, Aug. 1985.
[6] N. S. Jayant, ``Effects of packet losses in waveform coded speech and
improvements due to an odd-even sample-interpolation procedure,'' IEEE
Transactions on Communications, vol. COM-29, pp. 101--109, Feb. 1981.
[7] D. Minoli, ``Optimal packet length for packet voice communication,''
IEEE Transactions on Communications, vol. COM-27, pp. 607--611, Mar.
1979.
[8] V. Jacobson, ``Compressing TCP/IP headers for low-speed serial
links,'' Network Working Group Request for Comments RFC 1144, Lawrence
Berkeley Laboratory, Feb. 1990.
H. Schulzrinne Expires 03/01/94 [Page 62]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
[9] P. Francis, ``A near-term architecture for deploying Pip,'' IEEE
Network, vol. 7, pp. 30--37, May 1993.
[10] IMA Digital Audio Focus and Technical Working Groups, ``Recommended
practices for enhancing digital audio compatibility in multimedia
systems,'' tech. rep., Interactive Multimedia Association, Annapolis,
Maryland, Oct. 1992.
[11] W. A. Montgomery, ``Techniques for packet voice synchronization,''
IEEE Journal on Selected Areas in Communications, vol. SAC-1,
pp. 1022--1028, Dec. 1983.
[12] D. Cohen, ``A protocol for packet-switching voice communication,''
Computer Networks, vol. 2, pp. 320--331, September/October 1978.
[13] D. L. Mills, ``Network time protocol (version 3) -- specification,
implementation and analysis,'' Network Working Group Request for
Comments RFC 1305, University of Delaware, Mar. 1992.
[14] ISO/IEC JTC 1, ISO/IEC DIS 11172: Information technology --- coding
of moving pictures and associated audio for digital storage media up
to about 1.5 Mbit/s. International Organization for Standardization
and International Electrotechnical Commission, 1992.
[15] L. Delgrossi, C. Halstrick, R. G. Herrtwich, and H. St"uttgen, ``HeiTP:
a transport protocol for ST-II,'' in Proceedings of the Conference on
Global Communications (GLOBECOM), (Orlando, Florida), pp. 1369--1373
(40.02), IEEE, Dec. 1992.
[16] G. J. Holzmann, Design and Validation of Computer Protocols. Englewood
Cliffs, New Jersey: Prentice Hall, 1991.
[17] A. Nakassis, ``Fletcher's error detection algorithm: how to implement
it efficiently and how to avoid the most common pitfalls,'' ACM
Computer Communication Review, vol. 18, pp. 63--88, Oct. 1988.
[18] J. G. Fletcher, ``An arithmetic checksum for serial transmission,''
IEEE Transactions on Communications, vol. COM-30, pp. 247--252, Jan.
1982.
[19] J. Linn, ``Privacy enhancement for Internet electronic mail: Part III
--- algorithms, modes and identifiers,'' Network Working Group Request
for Comments RFC 1115, IETF, Aug. 1989.
[20] D. Balenson, ``Privacy enhancement for internet electronic mail: Part
III: Algorithms, modes, and identifiers,'' Network Working Group
Request for Comments RFC 1423, IETF, Feb. 1993.
[21] S. Kent, ``Privacy enhancement for internet electronic mail: Part II:
Certificate-based key management,'' Network Working Group Request for
Comments RFC 1422, IETF, Feb. 1993.
H. Schulzrinne Expires 03/01/94 [Page 63]
INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
[22] J. Linn, ``Privacy enhancement for Internet electronic mail: Part
I --- message encipherment and authentication procedures,'' Network
Working Group Request for Comments RFC 1113, IETF, Aug. 1989.
[23] R. Rivest, ``The MD5 message-digest algorithm,'' Network Working Group
Request for Comments RFC 1321, IETF, Apr. 1992.
[24] North American Directory Forum, ``A naming scheme for c=US,'' Network
Working Group Request for Comments RFC 1255, North American Directory
Forum, Sept. 1991.
[25] N. S. Jayant and P. Noll, Digital Coding of Waveforms. Englewood
Cliffs, New Jersey: Prentice Hall, 1984.
[26] P. T. Brady, ``A model for generating on-off speech patterns in
two-way conversation,'' Bell System Technical Journal, vol. 48,
pp. 2445--2472, Sept. 1969.
[27] J. Postel, ``Internet protocol,'' Network Working Group Request for
Comments RFC 791, Information Sciences Institute, Sept. 1981.
[28] R. Cole, ``PVP - a packet video protocol,'' W-Note 28, Information
Sciences Institute, University of Southern California, Los Angeles,
California, Aug. 1981.
[29] C. Topolcic, S. Casner, C. Lynn, Jr., P. Park, and K. Schroder,
``Experimental internet stream protocol, version 2 (ST-II),'' Network
Working Group Request for Comments RFC 1190, BBN Systems and
Technologies, Oct. 1990.
[30] C. Topolcic, ``ST II,'' in First International Workshop on Network and
Operating System Support for Digital Audio and Video, no. TR-90-062 in
ICSI Technical Reports, (Berkeley, California), 1990.
[31] J. B. Postel, ``DoD standard transmission control protocol,'' Network
Working Group Request for Comments RFC 761, Information Sciences
Institute, Jan. 1980.
[32] J. B. Postel, ``User datagram protocol,'' Network Working Group
Request for Comments RFC 768, ISI, Aug. 1980.
[33] D. R. Cheriton, ``VMTP: Versatile Message Transaction Protocol
specification,'' in Network Information Center RFC 1045, (Menlo Park,
California), pp. 1--123, SRI International, Feb. 1988.
H. Schulzrinne Expires 03/01/94 [Page 64]